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ABSTRACT 


In many important ways the design and iapltementation of 
programming languages are hindered rather than helped by BNF. We 
present an alternative meta-language based on the work of Pratt which 
retains much of the effective power of BNF but ie more convenient for 
designer, implementer, and user alike. Its amenability to formal 


treatment is desonstrated by a rigorous correctness proof of a simple 
implementation. 
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I. INTRODUCTION 


The design and implementation of programming languages is a complex probiem 
which must be addressed from at least four distinct viewpoints. These viewpoints reflect 
the different but interacting interests of the designer, implementer, user, and theoretician. 
We address specifically the kinds of problems evident in the following two scenarios: 


Scenario 1: The old dangling ELSE problem. 


An early ALGOL grammar in Backus Nauer Form (BNF) was ambiguous with 
respect to nested IF-THEN-ELSE statements. This was noticed by implementers who 
often adopted the fairly local solution of attaching an ELSE to the most recent 
available THEN. Although BNF grammars were eventually discovered corresponding 
to this resolution, the grammar for ALGOL was rewritten to simply forbid nested 
conditionals [Nauer 1963]. 


Scenario 2: A new, theoretically sound approach. 


This is a summary of advice given for the construction of deterministic parsers 
and translators in The Theory of Parsing, Translation and Compiling [Aho & Uilman 
1972]. 


1) Write your grammar in BNF. 


2) Decide whether you want top-down or bottom-up parsing (top down is more 
flexible for translation). 


8a) If you choose top-down: apply known transformations to the grammar and 
check the result for the LL(1) property. If successful, a reliable top-down 
parser may automatically be constructed which handles a general class of 
syntax directed translation. 


3b) If you choose bottom-up: attempt to modify the grammar to satisfy the 
SLR(1) or LALR(f) conditions. If successful, a bottom-up parser may be 
likewise constructed. 


4) ‘In both. cases, especially bottom-up, apply known: optimizing transformations 
to the parsers to attain practical efficiency. 


In the first scenario BNF is being used as the formal reference tool, since it enables 
precise syntactic description. It does not, however, reveal important: properties (e.g. 
ambiguity) which the language designer needs to consider. - “Farther, the implementer must 
work inf ormally, since the grammar itself does not suggest efficient parsing techniques (see 
the survey of various approaches in ALGOL 60 Implementetion [Randelt 1964)). Finally, 
evidence indicates that the user may-also be using informal -syntactic models (see the 
description of expression evaluation in Introduction to: ALGOL‘[Baumann 1964). This 
situation precludes any serious attempt at formal verification. 

A considerable amount of rigor has-been obtained via-the formal approach in the 
second scenario, but Aho and Ullman acknowledge several strortcomings. Many grammars 
cannot be made LL(I) and, even when they can, the-resulting grammars are usually large 
and awkward and thus unnatural for syntax directed translation rules. Formal techniques 
do not exist for obtaining SLR or LALR grammars. Finally, in both cases nontrivial 
changes to the original grammar usually requirethat the entire process be repeated. 

A fundamental weakness with these approaches is‘that BNF is inappropriate as a 
definitional meta-language; it is essentially based on theories of generative grammars. The 
practical demands of parsing and translating restrict us to certain “tractable” grammars, but 
such grammars are often very difficult to recognize. In addition these “tractable” grammars 
tend not to include the most convenient description of a language, so one usually ends up 
with several representations for the same language definition; eg., a simple grammar for the 
user, and a complicated one for the parser. Finally, it is often necessary to transform the 
grammar into a parse table and then into an optimized parse table. Such multiple 
representations form a severe obstacle to formal verification. 

What we would like, then, is a system which includes. 


1) A natural and convenient definitional meta-language for the designer, 


2) A user oriented meta-language which makes any defined language easy to learn 
and use, 


3) A simple method for automatically constructing an efficient parser/translator for 
any defined language, and 


4) Enough precision in the above to permit formal proof that all components agree 
precisely. 


Pratt presents a system in "Top Down Operator Precedence” [Pratt 1973] which 
addresses the first three of these issues quite well. He allows the implementer to “write 
arbitrary programs’ while offering “in place of the rigid structure of a BNF-oriented 
meta-language a modicum of supporting software, and a set of guidelines on how to write 
modular, efficient, compact and comprehensible translators while preserving the impression 
that one is really writing a grammar rather than a program.” This approach has been 
followed in the construction of CGOL, a combination definitional meta-language and 
extensible programming language [Pratt 1974] which demonstrates the power and 
convenience inherent in this approach. 

The CGOL system, as presented, does not satisfy the fourth criterion; it lacks a 
complete formal context in which correctness may be stated and proven. In this paper we 
complete a formal context, present an example implementation, and rigorously prove its 
correctness. 

We believe that many of the difficulties mentioned above may be avoided by writing 
grammars in a meta-language whose descriptive power is tailored to fit the intended 
application. We present and analyze such a meta-language for CGOL type translation; the 
meta-language expresses a class of languages which are easily and naturally parsed. For an 
exact definition of the describable languages, we present a user-oriented model which 
describes how sentences may be generated from any grammar. 

Since the meta-language is designed to fit the parsing method, it is possible to 
construct an extremely simple parsing program which operates by simply reading a given 
grammar as data. We give a LISP implementation of this parser, designed primarily for 
clarity and ease of proof. | 

The correctness proof for the example parser is presented in two parts; theoretical 
properties and a program proof. The theorems of the first part deal exclusively with 


properties of the meta-language; these proofs are completely independent of the program 
and the parsing algorithm. The use of these properties allows the actual program proof to 
deal almost exclusively with argument passing and flow of control; the program proof is 
tedious but straightforward. | 

Chapter II contains an introduction and analysis of the CGOL approach to parsing. 
Chapter III is an introduction and informal! discussion of our system: the syntactic 
meta-language, the generative model of defined languages, the parsing program, and 
correctness criteria. Chapter IV covers the same material with complete formal definitions, 
and Chapter V contains the correctness proof 


Il. THE CGOL APPROACH 


We begin with a presentation and analysis of the parsing/translating method 
proposed by Pratt; a motivation and detailed introduction may be found in "Top Down 
Operator Precedence” [Pratt 1973]. The discussion in this chapter centers on the parsing 
technique: how it works, what features yield unique advantages, and how it compares with 
known work in formal parsing theory. 


I1.A The Algorithm 


Pratt’s approach to translation (which we refer to as the CGOL approach, after its 
application in [Pratt 1974]), is specifically oriented toward the translation of expressions, 
where an expression is simply an operator (eg., + or x) with its arguments. For those not 
familiar with expression oriented programming languages, the analogy to arithmetic 
expressions is sufficient for the moment. Each operator of the defined language has 
associated with it a program which embodies most syntactic and semantic information for 
that operator. The programs, called denotations, are executed in a left to right scan by a 
simple, recursive algorithm; each denotation has the power to look at the next symbol in the 
input string, advance (but not back up) the current symbol pointer, and call the parsing 
algorithm recursively to scan another expression. The pointer to the input string is a global 
variable and may be advanced by any denotation. The denotation of a symbol may be 
called at two points in the algorithm: step 2 and step 4. Step 2 corresponds to the case 
where the operator is at the beginning of a string and does not take a left argument. Step 4 
assumes that the expression parsed so far is the left argument to the operator. 


PARSE is the function which is called to scan and translate an expression starting at the 
beginning of an input string. 


STEP 1: PARSE looks at the first symbol of its input string (it will never look farther ahead 
than the current pointer to the string). Since this symbol occurs at the beginning of an 
expression, it is assumed to be an operator which takes no argument on its left side 


(constants and variables are treated as operators with no arguments). PARSE executes 
the denotation associated with this symbol. 


STEP 2: The denotation for the current operator moves the pointer rightward along the 
input string, when necessary to gather right arguments. The denotation returns the 
translation of this expression, leaving the input pointer at the symbol following the 
expression. . 


STEP 3: PARSE now has the translation of an expression starting at the beginning of the 
input string. The question is asked: should this expression be given as a left 
argument to the next operator in the string, or should it be returned (presumably as a 
right argument to the caller of PARSE)? The decision is made by comparing numerical 
binding powers associated with each operator; the next symbol must have a left 
binding power associated with it, and PARSE was given, as an argument, the right 
binding power of its caller. 


STEP 4a: If the right binding power of the caller is greater (or equal), the translation 
obtained so far is returned. RETURN. 


STEP 4b: If the left. binding power of the next symbol is greater, then it is assumed to be 
an operator, and the expression translated so far is its left argument. The transiation 
is passed as an argument to the execution of the denotation asseciated with this 
symbol. 


STEP 5: The denotation for this operator moves the pointer rightward along the input 
string, when necessary to gather right arguments. The denotation returns the 
translation of this expression, leaving the input pointer at the symbol following the 
expression. 


STEP 6: Iterate to step 3. 
We observe that the definitional information for each operator falls into four general 


categories. In the first category we include the specification of the operator's left and right 
binding powers; these integers are used to locate the right ends of expressions. The second — 


category simply indicates the presence or absence of a left argument. This feature belongs 
in a separate category since the collection of a left argument is not directly controlled by a 
denotation;i.e., when a denotation is executed, its left argument, if any, has already been 
scanned and translated. Denotation for operators with a left argument are executed from 
Step 4b, those without from Step 2. The third category includes a procedure for right 
argument collection which may invoke a number of techniques, the most obvious of which 
is the collection of an expression (argument) by recursively calling the parser. In addition, 
parsing decisions may be made by looking one symbol ahead in the input string. The 
fourth category includes a procedure for translation. 


11.B Comparisons with other Methods 


From a theoretical standpoint a CGOL translator has unlimited syntactic power. 
This is not, however, the primary issue; it is much more important to ask what it can do 
well. We provide one answer to this question by comparing the algorithm to a number of 
known parsing methods, showing how CGOL combines certain advantages of each. This 


discussion presupposes some familiarity with formal parsing theory. The topics discussed 
are: 


Introduction and Example Grammars 
. The Parse Type 
. Skeletal Grammars, Ambiguity 
. Operator Languages, Precedence Parsing 
. Flow of Control 
. Combination Unary/Binary Operators 


Ot m® OF N = 


1. Introduction and Example Grammars 


The key to the effectiveness of the CGOL parser is the simple but powerful control 
Structure. The syntactic power of the parser is theoretically unlimited, since arbitrary 
programs may be written as denotations; the control structure, however, creates an 
environment in which a great many grammatical constructs may be handled very simply 
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The language CGOL presented in [Pratt 1974] and the translator constructor in this paper 
are examples. This flexibility and convenience result from.a unique combination of parsing 
techniques, most of them well known by themselves. ‘Rather than-asking to which 
theoretical class CGOL belongs, we took for similarities between the operation of the 
CGOL parser and these in known categories. CGOL combines: — from: many 
different approaches. 

We will refer to the following grammars in this discussion. They illustrate in a 
simple way several of the issues relevant to parsing schemes. Example A is an ambiguous 
grammar for the language of arithmetic expressions;‘A’ isa standard unambiguous version 
in which + and * associate to the left and associates to the right. ‘These properties result 
from the use of single productions and left and right recursion. B.is:an ambiguous 
grammar for IF-THEN-ELSE statements (the wel-known dangling ELSE problem). 
Grammar B” is.an unambiguous grammar for the same language, representing the usual 
solution to the problem. 


Grammar A 
1 ExE+E 
2 E+»w~Ex E 
3 ExEE. 
4 Ex(eE) 
S Era 

Grammar A’ 
1 Ex ET 
2 Eo#f}t 
3 ToT «FEF 
4 T oF 
5 FsePtF 
6 F+P. 
7 P+(E) 
8 Pra 


B 


Grammar B 
1 S + if B then S 
2 S + if B then S else S 
3 Sc 
4 B-+bor B 
S B > b 


1 S + if B then S 

2 S » if B then S’ else S 
3 S>7c 

4 S’s if B then S’ else S’ 
5 S’+ Cc 

6 B+bor B 

7 B+b 


2. The Parse Type 


If we trace the operation of the CGOL parser, observing the order in which the 
components of the parse tree are recognized and assembled, we see that it is essentially 
producing a left corner (LC) parse. We begin the discussion of this observation with a 
brief look at top-down parsing. Parse types are categorized top-down, bottom-up, etc. 
according to the order in which they recognize the grammar rules used to derive the input 
sentence. An equivalent model is to imagine the derivation as a tree with the root 
nonterminal symbol at the top, and the leaves corresponding to the sentence. A top-down 
parser recreates this tree from the top downward, root nonterminal first. Stearns points out 
that this type of parser is especially useful for combined parser/translators; since each 
production is identified before its descendents in the tree, an implementation may 
conveniently use recursive descent. Translation rules may correspond to grammar rules, 
which may correspond to nested environments in the translating program. These ideas are 
discussed at length in [Knuth 1968] and [Lewis & Stearns 1968]. 


LL Languages 


The LL(k) grammars are those which can be naturally parsed deterministically (i.e. 
without backtrack as the input is scanned) from left to right, top-down. he usual parser 
associated with LL grammars is the predictive parser which looks ahead k symbols on the 
input stream before deciding which production to recognize at any given point in the parse. 
In addition to the general usefulness of top-down parsing, predictive parsers for LL(k) 
grammars are very simple; they may be implemented on a one-state Deterministic Push 
Down Automaton (DPDA) [Kurki-Suonio 1969]. Further, they are very efficient and handle 
errors reasonably well [Aho & Uliman 1972]. 

The central problem with LL parsing is that very few gramr:..ai, are LL(k). In fact, 
very few languages have LL(k) grammars for any k; an example is grammar B’, which 
generates a non-LL language. When languages do have LL(k) grammars, these are not 
always the smallest or most natural descriptions of the language. For example, Stearns 
discusses transformations which may convert grammars into LL(l) grammars at the expense 
of added complexity [Stearns 1971]. Grammar A’ for arithmetic expressions is not an LL 
grammar for any k because of left recursion (in rules like E + E + T). Left recursion may 
be eliminated by converting a grammar to Greibach Normal Form (via a known algorithm). 
The GNF grammar for arithmetic expressions is essentially right associative, although the 
old grammar parse may always be recovered from a new grammar parse. Stearns presents 
optimizations which reduce the nonterminal explosion in the case of arithmetic expressions 
(in general the transformation squares the number of nonterminal symbols), but the result 
depends heavily on the fact that this is an operator precedence language. This property of 
arithmetic expression grammars (such as A’) allows a simpler treatment by the direct use of 
operator precedence (to be discussed below). 


Left Corner Parsing 


As mentioned above, LC parsing is a variant on top down parsing. While a top 
down parser must recognize the occurrence of a rule before any of its descendants, an LC 
parser does not until the leftmost descendant has been found. This leftmost descendant, the 
leftmost symbol in the right part of the rule, is called the left corner. This corresponds 
quite closely to the operation of the CGOL parser; each rule in CGOL corresponds to an 
Operator, and each operator is recognized (its denotation executed) as it is encountered in a 


left to right scan. Since operators may have expressions occurring as left arguments, they 
are recognized after their left corner. This parse method has been said to parse the left 
corner of a rule bottom-up and the rest of the rule top-down. When the first symbol of a 
rule is a nonterminal symbol, as with all NILFIX and PREFIX operators in CGOL, the 
parser is operating essentially top down. 

Nondeterministic LC parsing has been used for some time [Irons 1961] [Cheatham 
1967], but only more recent work has examined deterministic LC parsing. Rosenkrantz and 
Lewis identify the LC(k) languages, those which have LC(k) grammars and can be parsed 
deterministically LC with k symbol lookahead [Rosenkrantz & Lewis 1970). The class of 
LC(k) languages is shown to be identical to the class of LL(k) languages via the result that 
the elimination of left recursion produces an LL(k) grammar if and only if the original 
grammar was LC(k). Thus LC(k) grammars give us no ultimate increase in expressive 
power, but they do offer a naturalness and economy of description in many cases. In an 
LC(k) translator this advantage is gained at the cost of some potential flexibility (since left 
corner nonterminals may not be parsed top-down). An important advantage is that one rule 
corresponds to one operator, and the semantics for a rule may be conveniently localized. 

Grammar A’ is LC(1), and thus a transformed version, without left recursion, is 
LL(1); in fact, this is nearly identical to the example transformed by Stearns in [Stearns 1971] 
where the number of nonterminals becomes squared under the transformation. Grammar 
B’, however, is not LL(k) for any k. In fact, it is intuitively clear that the language 
generated by B” is not an LL language, since it is impossible to tell at the begining of a 
string which of two rules is to be applied; there can be no LL(k) or LC(k) grammar which 
generates the language. 


3. Skeletal Grammars 


While the CGOL parser traces a left corner parse and operates with lookahead I, it is 
not actually an LC parser as defined by Rosenkrantz and Lewis, since it uses no grammar 
in the ordinary sense. There is only one nonterminal in the parser, the implicit one for an 
expression. All expressions are treated the same. What we have then is more like the 
grammar A’, sometimes called a skeletal grammar. Skeletal grammars typically are 
ambiguous, so external means need to be used to resolve any ambiguous sentences. The 
CGOL parser resolves this ambiguity by a number of techniques sometimes seen in parser 


implementations, linear operator precedence functions, flow of control decisions, and 
two-state unary vs. binary operator recognition. Some of these techniques have been viewed 
as optimizations to be used whenever a grammar is found with the right property, although 
it is seldom obvious at a glance if this is the case. Techniques have even been developed to 
transform grammars in the hope that the desirable properties might be obtained. 

The CGOL approach is to avoid juggling context-free grammars at all. This is done 
by not attempting to describe difficult matters with cfg’rules. These rules are certainly 
useful for describing phrase structure (asin the two ambiguous example grammars), but 
begin to grow in size and tose clarity when they describe ee hierarchies and 
association (as in grammar A). 


4. Operator Languages, Precedence Parsing 


Some of the information which is normally represented by nonterminal symbols may 
be defined as properties of the terminal symbols, if the languages are defined by operator 
grammars. These are context-free grammars which have'no.ad jacent nonterminal symbols. 
Although these are something of a special case in the literature on’ formal jJanguages, a great 
many useful programming languages have (or are very close to having) Operator grammars. 
All four example grammars are operator grammars; see also [Floyd 1963] for an operator 
grammar for ALGOL. In fact, it seems that adjacent nonterminats usually appear when we 
try to solve some “problem” with a grammar (say ambiguity, or left recursion) by 
transforming it into something less natural. Rules with no nonterminal symbols at all are 
especially nonintuitive, we like to think of each rule as having some meaning, but when a 
rule has no associated terminal symbols, its occurrence relative to a sentence will only be 
implicit. In the CGOL parser each rule is attached to some syrnbol, an operator. With this 
restriction CGOL is able to apply the following techniques. 


Precedence Parsing 


The term precedence parsing describes a well known family of techniques used in 
bottom-up parsing. The standard implementation of a bottom-up parse is known as a 
shift-reduce algorithm. This algorithm scans the input, one symbol at a time, from left to 
right. A shift step reads an input symbol and pushes into onto a stack. A reduce step occurs 


when a sequence of symbols on the top of the stack correspond to the right side of a 
grammar production; this leftmost reducible phrase is called the “handle” of a sentential 
form. This series of symbols is popped off the stack and is replaced by the nonterminal 
symbol on the left of the rule. A parse is complete when the stack contains only the root 
nonterminal of the language and the input stream is empty; the output is a bottom-up parse. 
Precedence parsing methods are distinguished by the method of making the 
shift-reduce decision, i.e. deciding if the scan has reached the right end of a handle. The 
general technique is to derive from the grammar a relation (usually written >) on the 
symbols of the language. Although a variety of precedence techniques have been 
developed, their essential feature is that they compare two ad jacent symbols in a sentential 
form; if the relationship > holds between them, the right end of a handle has been reached. 


Operator Precedence 


The application of precedence techniques to operator languages leads to a well known 
and efficient parsing method (see [Floyd 1963]). Operator precedence grammars are those 
for which the shift-reduce decision may be made uniquely by considering only terminal 
symbols; i.e., the uppermost terminal symbol on the stack is compared with the next input 
symbol. Considerable storage space and algorithmic complexity is saved by simply ignoring 
nonterminal symbols; i.e., not using them to carry information. The resulting parse tree is 
called the skeletal parse, since all productions with single nonterminals on the right side are | 
missing. The interesting structure is there, though, since extra nonterminals with rules like 
E + T in Grammar A’ are often included only to express properties like right or left 
association and have no semantic implications. 

Although operator precedence seems a somewhat obscure property for a grammar to 
have, Floyd argues that many useful programming languages are quite close to having 
operator precedence grammars. He offers an ALGOL operator precedence grammar as an 
example and identifies certain problems which he suggests be solved via escape Clauses, or 
special parse techniques. It seems that the technique handles the majority of language 
features quite well, but has certain difficulties which would be much better dealt with by 
exception, than forced in the basic scheme. CGOL deals with some of these problems quite 
well. 

Pratt conjectures that operator precedence techniques are widely applicable because of 
their intuitive appeal; they correspond exactly to the ordinary conventions for writing 
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arithmetic operators. Grammar A’ for example is an operator grammar in which the 
relations fT >» * » + hold. These represent the notion of the precedence hierarchy of these 
operators. We also note that + >» + and * » », meaning that these two operators associate 
to the left. On the other hand, the relation f < f holds; this means, in the operator 
precedence scheme, that this operator associates to the right. 


Linear Precedence Functions 


An optimization often considered for operator precedence schemes (and for 
precedence relations in general) is the encuding of the precedence matrix (i.e. the relation) 
via linear functions. Typically, two integer valued functions f anu g are defined over 
terminal symbols. If for two terminal symbols x and y the relation x >» y holds, then it will 
also be true that f(x) > g{y). While the technique only works for a small number of 
possible matrices, it turns out to be easily applicable to grammars like A’. Again, the 
conventional hierarchy of the operators in arithmetic expressions allows this encoding 
scheme to work. 

An operator precedence ‘parser for arithmetic expressions is very compact and 
efficient. CGOL makes use of the operator precedence technique, but without forcing the 
designer to express his ideas in BNF first, only to have them transformed by algorithm into 
what might essentially be the original idea. The designer simply defines left and right 
binding powers for each operator. . 

We recall that left corner parsing treats the left argument to an operator in bottom-up 
mode, and the rest of the rule in top-down mode. It is in the bottom-up mode that this 
technique is used by CGOL. When PARSE has scanned a complete expression, a decision is 
made by binding powers. If the next token of the string wins the expression, then the 
expression becomes a left argument. If PARSE returns the expression, then the expression is 
the result of a top-down call from some higher level. The operation of CGOL for 
grammars composed of only arithmetic operators, like 8, is exactly parallel to the operation 
of the canonical strong LC machine of Rosenkrantz and Lewis [Rosenkrantz & Lewis 1970]. 
The nested environments of CGOL correspond to the stack of the LC machine. An LC 
stack entry may either be a single nonterminal symbol, corresponding to a call to PARSE 
which has not yet parsed an expression, or a pair of nonterminal symbols, corresponding to 
a call to ASSOC which already has a left argument (or left corner) parsed, waiting to be 
attached to something. 


5. Flow of Control 


A major difference between CGOL and the LC machine becomes clear when we 
consider grammar B’. This is an operator precedence grammar which is easily handled by 
traditional bottom-up methods, but it is not LL(k) for any k. By the result of Rosenkrantz 
and Lewis then it is also not LC(k) for any k. The CGOL parser handles this example 
with great ease, since the program for the operator IF can simply parse its THEN argument 
and then look one token ahead to see if it is ELSE. Both possibilities are treated by the 
same denotation, so we are using the equivalent of the ambiguous version, grammar'B. As 
with arithmetic expressions, CGOL uses an ambiguous grammar with a simple rule to 
resolve ambiguity; in this case it is simply to take the ELSE if it is there. Aho, Johnson, and 
Ullman treat this example in some detail, pointing out that this solution is a simple fix to 
the otherwise ambiguous top-down parsing table for grammar B 
[Aho, Johnson, & Ullman 1973] We have a situation where the top-down predictive parsing 
technique works for cases which are outside of the normally defined LL boundaries. By 
allowing arbitrary programs as denotations, CGOL allows an operator to collect any right 
arguments in a very general top-down fashion. We might say that each operator has its 
own top-down predictive parser for the grammar of its right arguments. It is this feature 
which allows the use of regular expressions to specify annotation patterns within the 
meta-language defined in this paper. In fact the restrictions placed on the use of the 
regular operators make each annotation pattern the equivalent of a miniature LL(1) 
language, although the restrictions are in fact even stronger than LL(I). 


6. Combination Unary/Binary Operators 


The third technique used to resolve ambiguity in a CGOL parser is a solution toa 
problem encountered by Floyd when he tried to write an operator precedence grammar for 
ALGOL. Certain symbols of the language have two uses, and operator precedence by itself 
can not distinguish between them. The common example of this is the minus operator 
which may be used either as unary or binary. CGOL allows this double definition in a 
general form. Any operator may have two unrelated definitions if one of them has a left 
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argument and one does not. CGOL is in this sense a two state machine, one state 
corresponding to an immediate call to PARSE, when no left argument is present, and the 
other to a call to ASSOC, when there is a left argument available. There is never any 
ambiguity. 


2l 


Ill. BASIC CONCEPTS 

In this chapter we motivate and informally introduce the components of our language 
system. The notions presented. will be given full formal treatment in the following chapters. 
We discuss first the meta-language, giving examples of its. use. Since the meta-language is 
nonstandard, we will present a generative model which determines the sentences of a 
defined language. The chapter concludes with a brief discussion of the translator algorithm 
and its correctness criteria. 
IH.A The Meta-Language 

Our formal language system is based on a syntactic meta-language which: 


(a) restricts the syntactic power of the system in a way which permits rigorous proofs, 


(b) embodies the full power of the scheme in the sense that we want it to express 
anything which the parse/translation scheme handles naturally and efficiently, and 


(c) allows the automatic construction of simple translators. 


We recall from Chapter II that the translator uses four types of information for each 
operator in the defined language: 


(1) Left and right binding powers, 
_ (2) Presence of left argument, 
(3) Pattern of right and annotated arguments, and 


(4) Translation rule 


In the original CGOL facility this information is specified by the designer in a varying 
mixture of declarative and procedural modes. To faciiitate uniform treatment, we wilt allow — 
exactly one type of meta-language statement, a production, which will contain all of the data 
necessary to define a single operator. We restrict the syntactic power of the translator by 
requiring that alt syntactic information (parts [+9 ax liste! atiove} be stated in a declarative’ 
language, leaving onty the translation rule in proceduratform. This declarative segment 
includes a template of argument positions (parts 2 and: 3} and’ a: specification of binding 
powers (part f). Thus we might. write: 


Ex. | w~ "+" «© (14,14; <denotat ion> 


to define + as an operator of the language with left and: right arguments. It has left and 
right binding powers of I¢, and <denotat ion> is a procedure which accepts as input the 
translations of the arguments and calculates the transtation of the entire phrase. To deal 
with more generat programming language features we allow productions like: 


Ex. 2 "IF" ~ "THEN" ~ ("ELSE” ~ | A) ,6;<denotation> 


which defines the standard conditional operator. This production includes the specification 
of extra right arguments (in addition to the normal one with EF acting as a. prefix 
Operator); we call "THEN" ~ ("ELSE" ~ | A} the annotation pattern of the operator 
"IF". Here the alternation (or union) symbol | is used: to specify a choice of two patterns, 
one of which is the nuft string 4. An even more powerful conditional may be specified by 
the production: 


Ex.3 "IF" ~ "THEN" ~ ("ELSEIF" ~ "THEN" ~ )* ("ELSE" t d) ,6;<den> 


which uses the star closure symbol * to indicate any number of occurrences. 

We write annotation patterns using regular expression notation (as in Examples 2 
and 3) because it is well known, quite general, and amenable to formal treatment. In an 
actual implementation one might extend this notation to include pattern operations 
expressible in terms of the basic notation. For example, we might introduce the brackets [ 
and J, and let fa} denote fa[A)}. We could then write simply: 
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Ex. 2’ "IF" ~ "THEN" ~ ["ELSE" ~] , 6;<denotation> 


instead of Example 2. Another possibility might be < and > to mean + closure (one or 
more occurrences). Such extensions are not included here, since they do not affect the 
theoretical behavior of the meta-language. 

This meta-language is restricted enough to allow formal treatment (goal a above) and 
is general enough to exploit the power of the parsing scheme (goal b). The patterns, 
however, are too powerful for simple parsing (goal c); any of these patterns could 
theoretically be parsed, but not all of them easily or unambiguously. We solve this by 
restricting the class of permissible patterns to those within the power of a very simple 
parsing algorithm. . 

This matching algorithm for patterns (arguments on the right side of operators) is 
deterministic and never looks more than one symbol ahead in the input string. Our model 
of the algorithm is a person with one finger on the pattern, one finger on the input string, 
and almost no memory! It should always be clear what to do next; no backing up allowed. 
To put this differently, the user should always be able to understand the parsing method. 
To insure the correct operation of the parser, we adopt the following three rules. 

The first rule is that patterns joined by alternation not begin with the same symbol. 
Thus we disallow the pattern: 


Ex.1’ ~ "FF" ~ ("THEN" ~ | "THEN" ~ "ELSE" ~) ,6;<denotation> 


as an alternative to Example |. In fact we prefer the original form for the following reason: 
an annotated argument should be identified by the name of the preceding symbol, not by its 
position in a pattern. We intend that there be no difference between the two THEN 
arguments specified in example 1’. 

The second rule solves a problem arising from the use of the symbol A in patterns 
Whenever the pattern A is an alternative, the scanner could "match" 4 and miss a non-null 
matching symbol. This problem is solved by a fiat similar to the dangling ELSE solution. 
The parser will always match as much of the input string as possible; the pattern A is always 
the lowest priority choice. 

The third rule prohibits certain other patterns which cannot be completely handled 
by the parser. For example, we consider the production 
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Ex. 4 "FOO" ~ ("BAR" | A) "BAR" ,2;<denotation> 


which describes two possible phrases; one has one occurrence of BAR, the other has two. 
Because of the fiat above, our algorithm can onty parse the second possibility correctly. 
This a local case of the dangling ELSE probtem, and since it 4s detectable, we disallow 
patterns in which it occurs. Informatly, this rule restricts the use of patterns which give the 
parser a choice whether to continue, based on the presence of some delimiter symbol like 
ELSE. We will require that such a pattern not be concatenated on the feft with a pattern 
which can start with one of its delimiter symbols. In place of Example 4 we might use the 
production 


Ex. ¢' “FOO” ~ "BAR" ("BAR® | A) ,2:<denotation> 


which matches the same phrases but can be parsed correctly. 

While net immediately obvious, these restrictions are completely focal to each 
production and are intuitively motivated Patterns which violate them and can sometimes 
be rewritten in an acceptable form, and the acceptable form often makes more sense. In 
fact, the verification of these conditions is computationally quite sienple and an interactive 
definitional facility would have verification and debugging aids built-in. These rules are 
considerably simpler and more intuitively appealing than the LL and LR conditions. 

On a globai level, use of the meta-language is quite straightforward; the global 
restrictions which do exist are very simple. Only one production may be given per operator, 
although some symbols may be used for two different operators, one with a left argument 
and one without (eg., the binary and unary minus operators would be defined in two 
separate productions). A symbol defined as an operator may also be usey as a delimiter (in 
annotation patterns) as long as its binding powers remain well defined, since the role of a 
delimiter is passive. This sort of detail is trivially manageable by a definitional facility. 

An important property of this meta-language is that a set of productions forms a 
complete language definition; no other information is necessary. It is precisely this extreme 
modularity which makes designing extensible languages convenient. 
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HI.B User Model 


Once we have a language definition, a set of productions, we want to offer the user a 
manual explaining how to use the defined language. We claim that the productions 
themselves are straightforward enough (and their syntactic interactions simple enough) to 
serve as the basic manual, once our generative model is understood. For precision and 
verification this model will be presented in formal terms. It should be understood, however, 
that the formalism is intended only to add rigor to intuition; intuition need not be bent in 
order to agree with formalism. Some of the assumptions on which the model is based are 
discussed in Top Down Operator Precedence [Pratt 73]. 

The operator is the basic definitional unit in these languages; appropriately, the 
user's primitive concept is the relation "is an argument of”. We carry this one step further 
by specifying what kind of argument (what role it plays). Also, to allow more than one 
argument of the same kind, we specify an ordering. It is then natural to represent 
expressions as trees: nodes correspond to operators and subtrees correspond to arguments. 
The branches are ordered and labelled to identify the argument: normal arithmetic 
arguments are connected by branches labelled teft or right, and annotated arguments are 
labelled by the annotating token itself, the delimiter. This is very closely related to 
McCarthy’s abstract syntax [McCarthy 1963]. 

The purpose of syntactic convention is to uniquely represent these expression trees as 
linear strings of symbols. Two well known examples are the use of postfix and prefix 
notation to represent ordered trees. In the domain of binary trees infix notation is 
commonly used, but here additional conventions are necessary to resolve the association of 
intervening arguments. An example of this problem is the string atbxc, where we know by 
convention that b is the left argument of the operator * and not the right argument of +. 
The convention used here is usually viewed as a hierarchy of the arithmetic operators in 
which the higher operators "go first” or "take precedence” over lower operators. We use this 
convention to recover the correct tree from a given string; it may also be used to determine 
which trees are directly expressible as strings, and which trees require the use of 
parentheses. 

The languages we define use a combination of notational conventions including 
infix. To deal with association problems we adopt a convention based on the idea of 
operator hierarchy. A binding power is a numerical value which represents the precedence 
level of an operator; thus an expression between two operators is understood to be an 


argument of the operator with the higher binding power. This convention is generalized 
somewhat by allowing separately specified left and right binding powers for each operator, 
allowing operators to behave asymmetrically. 

We incotporate this convetition in a model for writing linear expressions from trees. 
The basic rute for writing expressiorts is: don’t ase an expression e as a left (right) 
argument to an operator op if the teft (right) binding power of op is high enough to cause 
any subexpression of ¢ to associate incorrectly. We know that aep may be used as an 
argument to +, but a+b may mot be used as an argument to =. Fornvafly, we measure the 
resistance (of eath side) of an expression to false axsoctations. We will define the r-index 
(!-index) of att expression to be essentiatly the fowest right (left) binding power of any 
internal operator exposed to the right (left) side of the expression. ru: example, the 
r-index of a+t#e ts equal to the right binding power of +, sirite af operator to the right 
of this expression (Say another #) might take bac incorrectly asa left argument. The 
1-index of the expression SIN a is @, since it is totafy invulnerable to false associations 
on the left. Although this model does not aflow certaity expressions trees to be written, most 
defined languages inctude a bracketitig orerator (Itke parentheses) which is semantically null 
and creates an expression with |-index=r-index=8. Thus, {a+b} may be used as an 
argument Co ®. . 

The only other way in which operators may syntacticatly interact results from the 
generalized dangling ELSE problem. The expression IF a THEN b has the property that 
an ELSE otcurring itnmediately after b will cause the parser to continue collecting arguments 
for this phrase (recall the fiat: given the choice of continuing or not, the parser will always 
continue). The informal tute is: don’t follow an expression é by a delimiter which will get 
incorrectly inctuded with e (or some subexpression of e). This rule prohibits the use of an 
IF-THEN expression as the second argument (ie, the THEN argument) to an 
IF-THEN-ELSE expression. We formalize this rule by defining the c~set for each 
expression, the set of tokens which would cause argument collection to continue incorrectly 
at some level. We say: an expression may not be followed by a token in its c-set. 

These three properties (l-index, r-index, and c-set) completely describe the 
syntactic behavior of any expression. A stafidard BNF grammar would represent the same 
information implicitly by the use of one nonterminal symbol. More closely related 
techniques have been studied which attach various modifiers to nonterminal symbols in 
context-free grammars; see especially "Indexed Grammars” [Aho 1968] and the 
transformation defined on well-chained grammars in (Stearns I971] The CGOL approach 
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is extreme in the sense that nonterminal symbols play virtually no rote at all. 

The separate treatment of syntactic properties is an important feature of this 
approach; both designer and user can deal with the various syntactic issues explicitly and 
separately. The most prominent syntactic feature of a language is its basic phrase structure, 
expressed by the productions as an ambiguous context-free grammar with one nonterminal 
symbol (called “expression"). Argument association is dealt with separately by binding 
powers, similar to the arithmetic conventions. Pratt argues that binding powers may be 
usefully assigned on the basis of an implicit hierarchy of data types, corresponding closely 
to ordinary intuition and conventions for programming languages. The annotation patterns | 
are also treated separately. Delimiters like ELSE which can cause problems can be explicitly 
noted (an easily computable property) and the operator combinations which interact can be 
listed. For example, it would be observed that IF-THEN-ELSE expressions interact with 
themselves if improperly nested. In a well designed language, these interactions will be 
rather limited in number, freeing the user from this concern in most cases. 


111.C Automatic Parsing 


Our meta-language defines a class of programming languages for which the CGOL 
translation technique is particularly appropriate. We demonstrate by presenting a simple 
parsing program which, when given a set of productions as data, correctly parses sentences 
of the defined language and can be easily extended to handle translation via denotation 
programs. The program is a working (although inefficient) LISP implementation which 
requires the transformation of productions into a suitable LISP representation. 

A definitional facility would be a set of programs to provide this and other services 
to the designer. The meta-language processor is a program which accepts productions of the 
meta-language, either incrementally or in batches, and stores the information. In this 
implementation the data are simply attached to the name of the operator being defined (via 
the property list). A facility could also include automatic verification of annotation patterns 
with debugging advice, and automatic documentation. 

Incremental implementations would be convenient and could even be performed 
on-line. An extreme example is a bootstrap, in which denotations may be written in the 
language defined so far (e.g. the language CGOL [Pratt 1974]). 
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Hl.D Correctness 


We consider a formal proof of correctness an essential, practical-component of the 
system; it is pointless.to have automatic parsing without'a: guarantee that:no mistakes wilt be 
proper representation, the parser works correctly. - 

To say that the parser works correctly: requires. a: precise definition.of what it should. 
do. Our specifications:of a meta-language and user model provide:a:formal context in. 
which correctness: may. be rigorously defined. 

We say-that:a-parser operating on some language-definition:is:carrect when the 
following are true: 


I. If the expression (Le. tree) ¢.is written according: to-convantion: as the string , 
then the: ‘parser will recover: the tree e: 


Il. If the parser recovers a tree e, then the input: string is in the: defined language. 


Part I guarantees that any valid string of the language will: repay correctly; part IT 
assures that no-ineorrect strings willbe parsed: 

The. correctness theorem is actually:a statement iid behavior of two functions: 
writing (mapping trees into:strings)and:parsing: (mapping strings:imto trees). Both parts of 
the theorem are proven: by induction, but over differentdomains.. part-I over the domain 
of trees, and part Hover strings: It is. a coroiary of the theorem: that-the languages defined: 
are unambiguous; i.e. no string can be written from mere thar one tree: 

From the standpoint of formal language theery, the thearem: is: proof of 
equivalence of two:alternate language definition: mechanisms. A: generative description is 
presented as the user model; an analytic description is. implicit in the parsing program. 

The proof itself is carried out in-two phases. Inthe first, we prove a number of 
theoretical properties of the language class, ie. of the definitional mechanism. These 
properties are independent of any program:-or parsing algorithm. Given these:results, the 


actual proof of the parsing program is tedious, but quite straightforward 


IV. FORMAL DEFINITIONS 


In this chapter we present the formal details of the language system introduced in 
Chapter III. Section IV.A presents the meta-language; the parsing program for defined 
languages is given in Section IV.C. The generative model of defined languages, given in 
Section IV.B, permits a formal statement of parser correctness, discussed and proven in the 
next chapter. 


IV.A The Meta-language 
We begin by naming the basic lexical units of our defined languages. 
Definition: A token isa single lexical symbol in a defined language. 


Notation: 
(i) Actual tokens will be represented using only upper case letters; eg. IF, ELSE, 
+, and (. 
(ii) Lower case letters are used for meta-variables in this discussion; e.g. t (possibly 
subscripted) refers to some token. 
(iii) Greek letters represent strings of tokens;eg.a, 6, Y. 


While the token is a lexical unit, the operator is our basic definitional unit. 


Definition: An operator is a set of semantic and syntactic information, representing some 
operation. We use the meta-variable op for operators. 


Productions 
An important feature of this system is that all specifications necessary to define a 


programming language are in the form of operator definitions. A single operator definition 
is expressed in a meta-language statement called a production; productions are the only 


statements in the meta-language. 


Definition: A production is a cluster of information which defines an operator and 
associates it with a token of the defined language. A production defining the 
operator op for the token OP must be in one of four forms depending on the operator 


type. 
OPERATOR TYPE PRODUCTION 
NILFIX "OP" <p> ,<rbps;<denetation> 
PREF LX “OP" ~ <p> ,<rbp>;<denotat ion> 
POSTFIX ~ "OP" <p> ,<!bp>, <rbp>;<denotat ion> 
INF IX | ~ "OP" ~ <p> ,<\lbp>,<rbp>; <denotat ion> 
where: 


t) quotes (") are meta-language symbols enclosing the token being defined. 
2) ~ is a meta-language symbol denoting the presence of an argument. 
8) <p> isan optional annotation pattern, defined in the next paragraph. 


4) <lbp> and <rbp> are left and right binding powers, non-negative 
integers. 


5) <denotation> is a program which calculates the translation of op, given 
the translations of its arguments. 
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Notation: When an operator op has been defined we refer to the components of the 
production as follows: 


type lop] is one of {NILFIX,PREFIX,POSTFIX, INFIX}. 
plop) is the annotation pattern defined for op. 

Ibp [op] is the left binding power defined for . if any. 
rbp [op] is the right binding power defined for op. 

den lop) is the denotation defined for op. 


Aside from patterns, what we have is a simple formalism in which ordinary unary and 
binary arithmetic operators may be defined. The first part of each production is a template 
in which the defined operator is quoted and the symbol ~ is a place holder for arguments. 
The left and right binding powers are stated separately, and the denotation incorporates a 
translation rule. We recall the production in Example | of Section III.A in which the 
operator + is defined: 


Ex. | ~ "4" w~ 14,143 <denotation> 


In this case type[+] = INFIX, and Ibp[{+] = rbp[+] = 14. 

The optional use of annotation patterns is a distinguishing feature of this 
meta-language. A pattern allows an operator to take multiple right arguments, each labelled 
with an identifying token. In addition, tokens may be included which label no argument 
but play a purely syntactic role. 
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Definition: An annotation pattern, or simply pattern, is'an expression specif ying possible 
labelled argument configurations. We use the meta-variables p, ¢, and r to 
represent patterns. A pattern p must be in one:of ‘the folowing ‘forms: 


1A 
2 "d" where d is a token 
3. "d" ~ where d is a token, ~ a ‘meta-symbel as above 


or, inductively, for some annotation patterns q and r, and the 
associated sets first q, first r and cont 


¢ 
4. qr _if cont, fi first, = 
5. (¢fr) if firsta Nl first, = 
6. (q)* if cont, n first, = @ 


Definition: A delimiter is a token used in a pattern. We use the meta-variable d (possibly 
subscripted) to represent a delimiter. 


Before defining the sets first and cont, we refer briefly to Example 2 of Section IIIA: 
Ex. 2 "IF" ~ "FHEN" ~ ("ELSE" ~ | A) ,6;<denotation> 


In this example the operator IF is defined with typelop) = PREFIX, and we have the 
pattern p{IF] = "THEN" ~ ("ELSE* ~ | A) (which will be seen to satisfy the 
restrictions). As in the operator part of a production quotes enclose the tokens, in this case 
delimiters, of the defined language, and ~ holds the place of an argument. 

With the exception of the restrictions imposed on cases 4, 5, and 6, these patterns are 
ordinary regular expressions with the usual interpretation; the symbols A, |, and * denote 
the empty string, pattern alternation, and pattern star closure respectively. 

Although the symbol ~ is intended to hold the place of an argument (a 
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subexpression) we will expedite our discussion of patterns by considering a language in 
which we include the symbol ~ to match itself. Thus, we will say that the string d matches 
the pattern "d", and the string d ~ matches the pattern "d" ~. Two strings which match 
pCIF] are THEN ~ and THEN ~ ELSE ». 


Notation: When the string w matches the pattern p, we write w<p. 


Recalling the restrictions imposed in our definition of annotation patterns, we now 
define the sets first and cont. We begin by defining our notion of first. 


Definition: first(w) = the first symbol of the string w (undefined if w = 4). 


Definition: férst, = Uy cp, gun {firstlol}. 


The set first, is simply the generalization of first(w) to all strings matching p. Similarly, 


we have two forms of cont. 
Definition: My we<p then cont, (w) = Uape<p, Bar {first (A) }. 
Definition: cont, = U ac pont y (w). 


The set cont p (w) includes any symbol which may follow w in a longer string, when 
both w and the longer string match p. In the context of finding a string to match a 
particular pattern, this set has the following interpretation. Assume you are scanning a 
string from left to right and have just reached the end of a string w which matches the 
pattern p. If any of the symbols of cont ){w) occur next in the string, then it may be 
possible to continue scanning and find an extension of w which also matches p. Referring 
again to Example 2, we have cont, {IFJ (THEN ~) = {ELSE} and 


cont 5 ¢ypy (THEN ~ ELSE ~) = g. The set cont, is the generalization to include all tokens 
which might occur this way, so we have cont 5 {IF} 7 {ELSE}. 


The sets first and cont enable us to state important restrictions on the use of 
patterns, restrictions which are directly motivated by our parsing algorithm for matching 
strings to patterns. While we can not in general prevent non-local interaction of annotation 
patterns (e.g., nested IF-THEN-ELSE expressions), it is possible to insure that there are no 
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ambiguities or unexpected results relative to a single pattern. The three restrictions prevent 
any such problems. 

The essence of the matching algorithm is as follows: look at the part of the pattern 
remaining to be matched and decide what to do next. If the pattern is 4, then simply stop. 
If it is "d", then look at the next symbol in the input string. It must be d or there is an 
error. Likewise, "d" ~ means to check for d and afterwards “collect an argument”. When 
the pattern is the alternation (q|r), there is the obvious problem of choosing which 
* pattern to use. The decision is made by examining the next symbol and determining 
whether it is in the sets first q % first, When the pattern is gr, the two patterns are simply 


matched in order. Finally, when the pattern is (¢*), the next symbol is always checked for 
membership in first. If true, the pattern q is matched and the process repeated. 

The restrictions on patterns insure that the choices made by this method are always 
unique; i.e. that they are the only possible choices. Thus, in the case of (q|r) we require 
that first, first, = ¢, no symbol may be in both sets. The problem with qr is slightly 


more subtle; the restriction here (cont, M cont rr” ¢) insures that the choice, whether to 


q 
continue matching a longer string to q, or to stop and begin matching r, is always unique. 


The * operator is essentially an extension of concatenation, so the restriction on the pattern 
(q)* is similar. It must always be clear whether to continue matching an instance of q, or 
to go on to the next, so we require that cont, i first, = . Important properties of these 


restrictions, independent of any parsing algorithm, are proven in Section V.B. 


Sets of Productions 

We have now defined the local properties of a meta-language production; there is no 
other form of definitional information. A complete language definition is any set of 
productions, defining a set of operators, which satisfies minor global restrictions (to insure 
that all properties are well-defined). 


Definition: An operator is of type NUL-TYPE if it is defined without a left argument. 


An operator is of type LEF-TYPE if it is defined with a left argument. 


35 


Definition: A language definition D is a set of productions in which: 
I) no token OP has more that one NUL-TYPE production, 
2) no token OP has more than one LEF-TYPE production, and 
3) no token is both a LEF-TYPE operator and a delimiter. 


Conditions | and 2 allow a token to represent two operators in the special case where 
one operator takes a left argument and the other does not; i.e, when there will be no 
ambiguity. In this case, two separate operations are actually being defined, but they are 
represented by the same symbol. Such an token is both LEF-TYPE and NUL-TYPE. Context, 
i.e. the presence of the the left argument, will always make it clear which operator is meant. 
Condition 3 guarantees that the left binding power of every token is well defined, since the 
parser uses the convention that delimiters have Ibp = 8. The left binding power of all 
delimiters is by convention @. 


IV.B Generative Model 


We have presented in Section IV.A the structure of our meta-language. A generative 
model is now defined which determines the correspondence between a language definition D 
(a set of productions) and the languages defined by D (a set of token strings). The model is 
closely related to the assumptions on which the CGOL approach to translator writing is 
based: the argument relationship among operators and the syntactic conventions for linear 
representation are related but separate issues. 

We begin with the set Ep of abstract expressions, collections of operators with 
specified argument relationships. We then define three properties of expressions which 
measure potential for syntactic interaction. Given these properties, we define the subset 
E’p © Ep of expressions which are grammatical, ie, may be unambiguously represented as 
linear strings of tokens. The process of linear representation is defined as the function Wp, 
mapping expressions into the set 5* of strings of tokens. 


Expression Trees 


Our basic notion of abstract expression is based on the relationship “is an argument 
of" among operators. This notion is extended by ordering and labelling each instance of 
the relationship, identifying the particater role being played ‘by the argument. Thus, an 
instance of the relationship might be “is a left argument of” or “is an ELSE argument of”. 

Our formal model of these expressions is a set of ordered trees with labels on both 
nodes and branches. A node corresponds to an operator whose arguments (subtrees attached 
by ordered, labetled branches) occur in a configuration ‘appropriate to the definition of the 
operator. Examples of these trees are given in Figure 1. Figure ta ts an expression tree 
containing only arittimetic operators. Arguments here are labelled “left” and "right", 
indicating thelr rotes. Figure tb shows a conditional expression in which the test is the 

“right” argument and the alternative values are appropriately labefled. Figures ic and td 
illustrate possible uses of detimiters which label no arguments. In these cases the tokens ) 
and FI are included to signal the end of the expression. We formalize this latter technique 
by permitting labelted branches which connect to the null subtree, akhough we will not 
include the null tree as part of our set. 

We now define formally the set of expressions corresponding to a meta-language 
definition. Our basic requirement is that the argument configuration for each operator be 
appropriate to its definition. This requires a more precise definition of the correspondence 
between patterns and sequences of subtrees. . 


Definition: The ordered subtrees e,,...,¢, (n28), labelled d,,..,d,, match the 
pattern p iff one of the following is true: 


lL pe#A and n=9, ie there ktecne subtrees. 

2 p= "a" and n=1, where e, is nulland ded). 

3. p = "d" « and n=1, where " is non-null and ded). 
or, where q and r are patterns, one of the following: 


4 pear and 3k S<ksn such that e,,...,@ match qg, and 
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Figure |. Expression trees 


Cuateeees@, Match r. 
5. p = (qtr) and €),..-.,@, match ¢ or Fr. 


6. p = tq)* and either ne8 or 3k Ockan such that ¢),...,¢ match 
q, and 7 match Cg} *. 


We are now ready to define our complete set of expression trees. 


Definition: The set of expression trees Ep corresponding to a language definition D 
contains the set of finite trees defined inductively by: . 


Basis: ecEp, where e is a single node with no branches attached, iff the node 
has a label op such that op is defined in D, typetop]-NILFIX, and 
A~<p flop]. 


Induction: e€£p, where e is a tree with subtrees attached by labelled 
branches, iff the root nede has a label op such that op is defined in D, 
each non-null subtree isin Ep, and ene of the following cases holds: 


1. type lop] -NILFIX and e has subtrees e;,...,@, (n28), labelled 
d,.--+,d,, which match pfop]. 


2. type lop] =PREFIX and e has subtrees eg,€,,.--,€, (n2@), 
labelled right,d), vee oy where ey, eony eC, match Pp lop]. 


3. type [op] =POSTFIX and e has subtrees e44,€),---+@, (n28), 
labelled left,d,,...,d,, where @,,..+.,€, match plop). 


4. type lop]=INFIX and e has subtrees €,4,€9,€),---,€, (n28), 
labelled left,right,d,,...,d,, where e,,...,6, match 
plop}. 7 
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Syntactic Properties of Expressions 


Having defined our abstract domain of expressions, we now apply syntactic 
conventions. We claimed in the previous chapter that there are only two basic types of 
syntactic interaction possible among expressions in linear form. We define three properties 
of expressions (r-index, |-index, and c-set) which explicitly measure the tendency for 
an expression to participate in such interactions. 

The first and most common form of syntactic interaction is the association of 
intervening subexpressions (arguments). For example, in the expression a+b*c there is a 
choice, governed by convention, for the association of the subexpression b; it may either 
associate to the left (and become an argument to +) or to the right (as an argument to #). 
Since operators are subject to this interaction on either side (and binding powers may differ 
from right to left), we define two corresponding properties, beginning with the left. 


Definition: If e€Ep then |-index le) is defined inductively as follows: 
Basis: If e is a node with no branches attached, then 
I-indexle) = o. 
Induction: If e has subtrees then let op be the label of the root node: 
a) if op is of type LEF-TYPE (ie. if it has a left argument), then 
I-indexte) = min[lbp op}, I-index (egy)], 
b) otherwise (if no left argument) 
|-index (e) = ©, 
The value of |-index is a numerical measurement of an expression’s resistance to 
false association to the left. If an operator has no left argument at all, then there can never 


be an intervening expression on the left, so there can never be a problem. In this case, 
l-index is ». Since a left argument may itself be an expression with a left argument, and 


so on, this property is defined inductively over all such subexpressions.. An expression’s 
resistance is only as high as the weakest “exposed” operator. 

For: example; if e-is the:expression tree shown im Figpre ta,. there are two operators 
exposed to the-teft, + and *.. By definition weknow that: 

!-indexle] = min{lbp[+], ibpiel, IbpfAl], 

but since Ibp{A} = «this is equal to min thpp{+]., |bpéed}. Fronr this:we understand 
that we have two-subexpressions, A. and: As®; which might: be-fabelyassociated to the left. 
In an expression tree these-ex posed operators are: those which: may bereached from the top 
by following branches labelled left down the-tree: 

The situationon:the right side ‘of expressiens'is analogous:atthough complicated 
slightly by the-presence-of ‘muttiple right:arguavents. 
Definition: If ecEp then r-index(e) is defined inductively as follows: 

Basis: Ife: is a node with no branches attached, then: 
r-index{e) =. «, 
Induction: If e has subtrees; then: let op be the label-of the root node: 
a) if there is a subtree e,, and if itis non-null; then 
r-index{e) = minirbp lop) ,r~indexte,)), 
b) otherwise 
r-indexfe) = o; 

The value of r-index is analogous to |-index- except that we now refer to e, 

instead of €,¢. When-there are subtrees €;,...,¢, which match the pattern plop] (ie, 


when n>8) then e, is simply the last (rightmost) one. It is this subexpression which, if 
non-null, is exposed to the right and is subject to false association. For example, both 1 
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and X+1 are exposed to the right in the expression of Figure Ib. If e, is null, then its label 
d, is being used as a purely syntactic token to indicate that there are no more arguments to 
the right. In this case there is no possibility of false association, so the value of r-index 
is ». Examples of this are the expressions of Figures Ic and Id. Now if the annotation part 
of the expression is entirely null (i.e. n=@), then the expression is of the ordinary arithmetic 
variety (eg., Figure la). In this case, e, refers to the right argument eg, if there is one, 
and r-index is the exact counterpart of |-index. 

We turn now to the second type of syntactic interaction, the generalized dangling 
ELSE problem. We recall that our pattern matching algorithm (ie. for collecting right 
arguments) will continue to gather arguments as long as possible. We are interested in the 
case where the pattern p has been matched (say by the string w) and there is a choice 
whether to continue. Any token for which this is possible is by definition in cont, (w). 


Looking at our standard example where p{IF] = "THEN" ~ ("ELSE” ~ | A), we have: 
cont iF) f THEN ~ ) = {ELSE}. 


This tells us that if the operator IF has so far collected the token THEN and a following 
argument then the collection may stop, but if ELSE appears next in the input string, it will 
be included. When we deal with general expression trees, this problem can be caused either 
at the top level (by the pattern of the topmost operator) or at lower levels (in exposed 
rightmost arguments), so the property c-set is defined recursively, similar to r-index. 
The c-set of an expression is the set of all delimiters which would be incorrectly included 
if placed after the expression in linear form. 

This definition requires the property cont to be defined on an ordered set of subtrees, 
rather than the on strings of the original definition. The correspondence is quite 
straightforward: a null subtree e; with branch labelled d, corresponds to the single symbol 
d;, and a non-null subtree e; labelled d; corresponds to the string dj ~. As is proven in 
Lemma Il of Section V.B, this translation does not affect the definition of cont; the symbol 
~ can never be in the set. 
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Definition: If ecEp then c-set(e) is defined inductively as follows: 
Basis: If e is a node with no branches attached, then 
Induction: If e has subtrees, then fet op be the tabel of the root node: 
a) if there is a subtree ¢, and if it is non-null, then 
. c-set(e) = CORE » ypy (C1s+ +r Ga) ¥ c-setie,), 
b) otherwise 
c-set(e) = COME » top} (ey,006. py). 


Grammatical Expressions 


We now use these three syntactic properties to restrict our set Ey of expressions by 
eliminating those which permit unwanted syntactic interactions. . 


Definition: ecEp is grammatical iff one of the following is true: 
Basis: e hai no branches attached, or 
Induction: e is a tree with root node labeled op and with subtrees satist ying: 
0) each non-nult subtree is grammatical, | 
1) r-index (esq) 2 Hpptop], _—if there is a subtree eq. 
2) rbplopl < I-indexte), for @sisn, when e, is non-null. 


3) dj € c-set(e,,), for 1sisn, when e; is non-null. 
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This definition allows us to build trees while watching for syntactic problems. The 
restrictions correspond to the informal rules described in Section III.B; each restriction may 
be understood as the prevention of unwanted syntactic interaction. Restriction | covers the 
use of an expression as a left argument; it insures that the whole expression will be treated 
as the argument, not some exposed fragment. For example, this restriction would prevent 
the use of the expression in Fig. la as a left argument to the operator f, since the 
subexpression C would incorrectly become the left argument of ¢. Restriction 2 is the 
equivalent on the right side. Restriction 3 insures that no delimiter will be improperly 
included with a subexpression; e.g., don’t use an IF-THEN expression as the THEN argument 
to an IF-THEN-ELSE expression. 


Definition: E> = fecEp] e is grammatical}. 

Our defined language will be based on only the expression trees which are 
grammatical. Ungrammatical trees may be easily fixed by the addition of some operator 
with bracketing properties, typically the semantically null operator {. For example, the 


expression shown in Figure ic would, given a reasonable definition, have [bp = rbp = 8 
and cont = @; ie., it is syntactically secure. 


The Writing Function 


Now that we have eliminated syntactic problems from our set of expressions, we may 
use a trivial writing function. 


Definition: The writing function Wp is defined recursively on the set Ep as follows: 
If ecEpthen: Wyle) = «OP Bd,¥,...d,¥, where: 
OP is the token naming the operator at the root node of e. 
OX = Wy leg) if yap, exists, A otherwise. 


B = Woleg) if eg exists, A otherwise. 
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'd; = the label on tree e, for 1<is<n. 
¥; = Wyle) for 1sis<n, when e, is non-null, A otherwise. 
The linear representation of trees defined by Wp uses a very simple convention. An 


argument is preceded by its label, with two important exceptions: the labels left and 
right are implicitly represented by juxtaposition with the operator. 


The Defined Language 


Finally, the defined language Sp is simply the linear form of the grammatical trees. 
Definition: Given a set D of productions, the defined language is Sp, where 


Sp = WptE’p). 


IV.C The Parsing Program 


We present the parsing program in two parts, in addition to the actual program 
(which we will view as a function from strings into expression trees) we give a specification 
of the internal representation required for meta-language productions. A program which 
automatically converts a meta-language production to this internal form is called the 
meta-language processor. 


The Meta-language Processor 


There is virtually no processing of the information given in the productions of the 
meta-language. It is simply broken into the natural categories, converted into a standard 
LISP representation, and attached to the property list of the defined token. The categories 
and their property list names are: 
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1. Type (eg. INF IX) NUL-TYP, LEF-TYP 
Sh bentaisin ae | NUL-PAT, LEF-PAT 
cia binding power LBP 

4. Right binding power MUL REP, LEF-RBP 


5. Denotation NUL-DEN, LEF-DEN 


Since it is possible to have two operators for the same token, one with a left argument and 
one without, the two sets of data will be separately named so they may coexist and be 
independently retrieved from the property lists. The one exception is the left binding 
power, since it is irrelevant for NUL-TYPE operators. Any token used as a delimiter, 
however, will have its left binding power set to 8. The denotation properties will not be 
used in this implementation, since it will,only parse and not translate. _ 

The definitional information will be represented as LISP data in the following forms: 


Type: The NUL-TYP and LEF-TYP properties are simply the appropriate names. Thus 


NUL-TYP may be either NILFIX or PREFIX, and LEF-TYP may be either 
POSTFIX or INFIX. ee 


Left binding power: The property is a non-negative integer. 


Right binding power: The property is a non-negative integer. 
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Annotation Pattern: The representation of a pattern p is the list repr [p] defined 
recursively by: 


L If p=” then reprip] = (LAMB). 
2. If p = "d" then repr ([p] = (d) , where d is the token. 


3. If p = "“d" w~ then reprip] = (d ARG) , where d is the token. 
or, if ¢ and r are patterns, and repr (q] and repr ir} their representations: 


4. If p = qr 


then repr [p] = (CONC reprfq] repr ifr). 
5. If p = t¢fr) then repr [p} = (UNION repriq] reprir]). 
6. If p= (eds then repr {p) = (STAR repr'{q)). 


Since this information is on property lists, it is globally available to the parsing 
program; a request for one of these properties will have-the same value independent of the 
particular environment from which it is made. For the purposes of proof, we give the 
following axioms which formally specify the operation of the meta-language processor. 
Axiom 1: If the token OP is defined in D as a nilfix operator, then 

(a) (GET °GP °NUL-TYP) = NILFIX 
(b) (GET "OP "NUL-PAT) = repr [ptop}] 


© (c) (GET "OP "NUL-RBP) = rbp lop] 


Axiom 2: 


Axiom 3: 


Axiom 4: 


Axiom 5: 
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If the token OP is defined in D as a prefix operator, then 


(a) (GET ’OP *NUL-TYP) = PREFIX 
(b) (GET "OP ’NUL-PAT) = repr (plop]] 
(c) (GET *OP *NUL-RBP) = rbp lop] 


If the token OP is defined in D as a postfix operator, then 


(a) (GET ’OP *LEF-TYP) = POSTFIX 
(b) (GET "OP *LEF-PAT) = repr [plop}] 
(c) (GET OP ’LEF-RBP) = rbp [op] 


If the token OP is defined in D as an infix operator, then 


(a) (GET OP *LEF-TYP) = INFIX 
(b) (GET °OP 'LEF-PAT) = repr [p{opl] 
{c) (GET "OP *LEF-RBP) = rbp lop] 


If the token OP is used as a delimiter in any production in D, then 


(5) (GET °OP ’LBP) = @ 


It may now be seen how our global restrictions on sets of productions insure that all 
of these properties are well-defined. Properties NUL-TYP, NUL-PAT, NUL-RBP, and 
NUL-DEN can only be determined if a nilfix or prefix operator is defined for OP, but we 
only allow one such production per token. Similarly, LEF-TYP, LEF-PAT, LEF-RBP, and 
LEF-DEN are well-defined. LBP may only be determined if a postfix or infix operator is 
defined for OP (in which case only one such definition is allowed) or if it is used anywhere 
as a delimiter (in which case the LBP is 8, no matter how many times it is used). A token 
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may not, however, be both. 


The Parsing Program 


We present below the LISP code for a straightforward parser implementation. The 
parser returns the expression tree in a simple list representation defined below; an extension 
to the full translator would have the arguments passed to the denotation, rather than being 
assembled into a list. 


Expression Tree: The representation of a tree ecEp is the recursively defined list: 
repre] = (OP rigtt Tright 1 -++ fy ) where 
Tieft = (LEF T repr fe.g)) if e.g exists, otherwise non-existent 
rright = (RIGHT repr leg]) if eg exists, otherwise non-existent 
r; = (d, repr(eJ)) if e; is non-null 
rj = (dj) if e; is null 


Several prominent features of this program should be kept in mind; it was written for 
perspicuity and convenience of proof. There are therefore no global variable references; 
for each subroutine the input stream is passed as an argument and returned as a value. 

The result is a program which is approximately twice as long and much less efficient than it 
could be. The main problem is that passing the input string as an argument often requires 
that the same expression be evaluated more than once. This problem could be easily solved 
but would result in rather more obscure code; efficiency has been sacrificed for clarity. An 
equivalent but efficient program could be proven correct by proving its equivalence to this 
one. Such a proof should be considerably shorter than an original proof of correctness as 
given here. 


The Basic Parsing Program 


(DEFUN PARSE (RBP STRING) 
(ASSOC RBP (NUL-TYPE STRING))) 


(DEFUN ASSOC (RBP STATE) 
(COND ({LESSP RBP (GET (CADR STATE) *LBP)) 
(ASSOC RBP (LEF-TYPE STATE))) 
(T STATE))) 


This is the top level control structure of the parser. The function PARSE receives as 
input a right binding power and a list of symbols, the-string- in Sp 40 be parsed. The status 
of the parse is contained in the variable STATE which is passed and returned among the 
procedures. STATE is always a list whose first element is the representation of the 
expression (tree) parsed so far, and whose remaining elements are the unparsed input string. 
Given that an expression has been parsed, the function ASSOC (not the standard LISP 
function ASSOC) decides whether to give it as a left argument to the next operator in the 
string (by calling ASSOC recursively), or to return the current state... . 

The function NUL-TYPE collects the arguments for the next operator in the string, on 
the assumption that it is nilfix or prefix. It in turn calls NILFIX or PREFIX to handle the 
separate cases. The function LEF-TYPE is similar, except that the expression parsed so far 
is assumed to be the left argument to the next operator in the string. The subroutine FINO 
handles the collection of all annotation tokens and arguments; it uses the functions 
LAMBDA-P (predicate for null string membership:in a pattern) and: FIRST (the set first 
previously def ined). 


Functions to Process NUL-TYPE Operators 


(DEFUN NUL-TYPE (STRING) 
(COND ({(NULL (CDDR STATE)) ERROR) 


((EQ {GET (CAR STRING) °NUL-TYP) "NILFIX) © 


(NILFIX (CAR STRING) 
(COR STRING) 
(GET (CAR STRING) *NUL-RBP) 
(GET (CAR STRING) *NUL-PAT))) 
((EQ (GET {CAR STRING) "NUL-TYP) ’PREFIX) 
(PREFIX (CAR STRING) 
(CDR STRING) 
(GET {CAR STRING) °NUL-RBP) 
(GET (CAR STRING) *NUL-PAT))) 
(T 
(NILFIX (CAR STRING) 
(CDR STRING) 
8 
*(LAMB))) )) 


(DEFUN NILFIX (OPERATOR REST RBP PAT) 
{CONS (APPEND (LIST OPERATOR) 


send of input 


;operator 
sunparsed string 
srbp Lop] 


"3 plop) 


3as above 


sdefault case 
svariable or 
sconstant 


(CAR (FIND RBP (CONS NIL REST) PAT))) 


(COR (FIND RBP (CONS NIL REST) PAT)) )) 


(DEFUN PREFIX (OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 


(LIST (LIST "RIGHT (CAR (PARSE RBP REST)))) 


(CAR (FIND RBP 


(CONS NIL (CDR (PARSE RBP REST))) 


PAT))) 


(COR (FIND RBP (CONS NIL (COR(PARSE RBP REST)))PAT)))) 
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Functions to Process LEF-TYPE Operators 


(DEFUN LEF-TYPE (STATE) ae 
(COND ({NULL (CDDR STATE)) ERROR) . . send of string 


(({EQ (GET (CADR STATE) °LEF-TYP) * POSTFIX) 
(POSTFIX (CAR STATE) sleft arg 
(CADR STATE) - g¢operator 
(CDDR STATE) . gunparsed string 


_ (GET (CADR STATE) *LEF-RBP) © srbptop) 
(GET (CADR STATE) °LEF-PAT) 3). ;plop] 
((EQ (GET (CADR STATE) ’LEF-TYP) *INFIXOD 
(INFIX (CAR STATE) 3;as above 

(CADR STATE) 
(CDDR STATE) 
(GET (CADR STATE) ’LEF-RBP) 
(GET (CADR STATE) °LEF-PAT) )).. 

(T ERROR) )) sno left def. 


(DEFUN POSTFIX (LVAL OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 
(LIST (LIST *LEFT LYAL)) 
(CAR (FIND RBP {CONS NIL REST) PAT))) 
(CDR (FINO RBP {CONS NIL REST) PAT)))): 


(DEFUN INFIX (LVAL OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 
(LIST (LIST "LEFT LVAL)) 
{LIST (LIST "RIGHT (CAR (PARSE ROP REST)))) 
_ {CAR (FIND RBP 
(CONS NIL (COR (PARSE REP REST))) 
PAT))) 
{COR (FIND RBP (CONS NIL (CDR(PARSE RBP REST)))PAT)))) 
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Annotation Argument Processor 


{(DEFUN FIND (RBP STATE PAT) 
(COND ((EQ (CAR PAT) °LAMB) «spar 
STATE) 
3p=qr 
((EQ (CAR PAT) ’CONC) 
(FIND RBP (FIND RBP STATE (CADR PAT)) (CADDR PAT))) 
sp=(@|r) 
( (EQ (CAR PAT} °UNIGN) 
(COND ((MEMBER (CADR STATE) (FIRST (CADR PAT))) 
(FIND RBP STATE {CADR PAT))) 
((MEMNBER (CADR STATE) (FIRST (CADDR PAT))) 
(FIND RBP STATE (CADDR PAT))) 
({LAMBDA-P PAT) 
STATE) 
(T ERROR) )) sneither alternative matches 
| 3 p=(q)* 
~((EQ {CAR PAT) ’STAR) 
(COND ((MEMBER (CADR STATE) (FIRST (CADR PAT))) 
(FIND RBP (FIND RBP STATE (CADR PAT)) PAT)) 
(T STATE))) 
, 3 p="d" 
((AND (NULL (CDR PAT)) (EQ (CAR PAT) (CADR STATE))) 
(CONS {APPEND (CAR STATE) 
(LIST (LIST (CADR STATE)))) 
(CDDR STATE))) 
3p="d" ~ 
(({EQ (CAR PAT) (CADR STATE)) 
(CONS (APPEND {CAR STATE) 
(LIST {LIST (CADR STATE) 
(CAR (PARSE RBP (CDOR STATE)))))) 
(CDR (PARSE RBP (CODR STATE))))) 
(T ERROR) }) smissing token-- (car pattern) 


Pattern Processing Functions 


(DEFUN LAMBDA-P 


(P) 
(COND ((EQ (CAR P) *LAMB) T) $p=r 
((EQ (CAR P) *CONC) spear 
(AND (LAMBDA-P (CADR P)) (LAMBDA-P (CADOR P)))) 
((EQ (CAR P) *UNION)  gp=lqir) 
(OR (LAMBDA-P (CADR P}) (LAMBDA-P (CADDR P)))) 
((EQ (CAR P) ’STAR) T) 3p=(q)*® 
(T NILD)) gp pe"d" or "d"~ 
(DEFUN FIRST 
(P) ; 
(COND ((EQ (CAR P) *LAMB) NIL) - sper 
((EQ (CAR P) *CONC) sp=¢r 


(APPEND (FIRST (CADR P)) 
(COND ({LAMBDA-P (CADR P}) (FIRST (CADOR P))) 


(T NILD))) 
((EQ (CAR P) UNION) — gpelqir) 
(APPEND (FIRST ({CADR P)) (FIRST (CADOR P)))) 
((EQ (CAR P) *STAR) (FIRST (CADR P))) = = spe lq)® 


“(T (LIST (CAR P))))) og pe"d" or "d"~ 


V. CORRECTNESS 


Using the definitions presented in Chapter IV, we are now prepared to formally state 
and prove the notion of correctness discussed informally in Section IHD. In the first section 
of this chapter we state cur main result, the PARSE theorem, and discuss three important 
corollaries which embody more closely our intuitive notions of cofrectness. Section V.B 
presents a number of preliminary Semmas, dealing primarily with properties of annotation 
patterns in our meta-language. These results are theoretical properties and are completely 
independent of the parsing algorithm. Sections V.C and VD contain the proofs of parts I 
and II, respectively, of the PARSE theorem; these theorems are tong but straightforward 
since the interesting theoretical results are separately proven. 


V.A Formal Statement 


We begin a formal statement of correctness by recalling the user-oriented description 
of a defined language. For any set of meta-language productions 0, the language Sp S =* 
defined by D is Sp = Wp {E’p), where Wp is the writing function and €'p is the set of 
grammatical expression trees. The parser for the language, constructed by the algorithm of 
Section IVC, is represented by the function Pp. This function maps strings of £* into 
expression trees (defined in IV.C). The function Pp is partial; when we write Pp(5) = e, 
we mean that the parser, when given the input string 6, hatts error-free and returns e. We 
State now our main result. 


PARSE THEOREM: 
I. VD VecE’y (Pp{Wple)) = e) 
II. ; VD VSex* (Pp(s) halts error-free » 5 € Sp) 


For the rest of this chapter we assume that D refers to some language definition expressed 
in the meta-language of Section IV.A; i.e, we drop the "for all D”. 


We examine now the suff iciency of the result relative to general. notions of 
correctness in the form of corollaries. The first is that a translator should be an acceptor for 
the language Sp in the ordinary sense: the translator should halt error-free exactly when 
given a sentence in the language Sp. 


Corollary 1 (Acceptor): 
VSEL* (Pp{S) halts error-free  & € Sp) - 
Proof: One direction simply restates part Il of the PARSE theorem. Now assume & € Sp. 
By definition there is some e € E’p such that 8 = Wyle). Part I says 


Pp(Wpfe)) = Pp(S) = e; ie; PARSE halts error-free.§ 


We also expect that the translator, when it halts error-free, returns a valid parse of 
the input string. 


Corollary 2 (Parser): 
VSESp (Pp(S) € Ep n Wyf{Pp(s)) = 5). 
Proof: Assume & € Sp. Then there is some e € E’p such that = Wyle). By Part I we 
* know Pp(5) = PptWpfe)) = e € E’y. Furthermore, since Pp(5) = e, we have 
Wp (Pp(5)) = Wyle) = 5.58 : 
We note that Corollary 2 only guarantees the output of some valid expression tree, or 
parse, for each input. We have not proven that such a parse must be unique; ie., that the 


language is unambiguous. Ambiguity is a property of a language and its means of 
definition, not of a particular parsing scheme. 


Corollary 3 (Uniqueness): 
Ve, e’cE’) (Up led=Np{e) 2 exe’) 


Proof: Assume Wyle) = Wolfe’) for e,e’ € E‘) Since the parser is a function, 
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Pp (Wple)) = Pp(Wple). Then by Part I we have e = Pp(Wple)) 
= Pp (Wye?) = ef 


Although not strictly a property of the parser, we treat this property here for completeness 
and convenience of proof. 
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V.B Preliminary Lemmas 


This section formally states and proves a number of necessary properties of our 
definitional system. Some are merely restatements of definitions and are included for 
uniform reference; the majority are derived properties which are essential to the program 
proof. The final two lemmas are correctness proofs of two simple utility programs, 
LAMBDA-P and FIRST. 

We begin with binding powers. 


Lemma 1 (Binding powers): 


(a) If the token op is defined as an operator in D, then rbplop) 2 @ and Ibplop] 2 8 
if defined. 


(b) If the token d is used as a delimiter in D, then Ibp[d] = 8. 
(c) For any ecEp, |-indexfe] 2 8 and r-indexle] 2 9. 


Proof: Parts (a) and (b) are immediate from the definitions. Part (c) uses part (a) and 
follows by trivial induction over the definitions of 1-index and r-index.§ 


The following lemmas describe properties of annotation patterns. Although patterns 
ultimately determine sequences of labelled subtrees, these properties will be stated and 
proven in terms of a simpler but equivalent language. We say that a pattern may be 
matched by strings of symbols, where the symbols include the special symbol ~ and tokens 
of the defined language. The same convention was used in the discussion of first and cont 
in Section IV.A. The correspondence between the strings used here and the ordered sets of 
labelled subtrees is straightforward. The symbol ~ can only follow a token in strings which 
match patterns. A token d followed by ~ in one of these strings corresponds to a non-null 
subtree labelled d. A token d not followed by ~ corresponds to a null subtree labelled d. 
Lemmas 4 and I! guarantee that the symbol ~ is invisible; i.e., it plays no role in any of the 
results presented here. The results apply equally to sequences of labelled subtrees. 

For convenience we restate here an essential feature of the definition of patterns, the 
restrictions on the inductive use of pattern concatenation, alternation, and star closure. 


Restrictions (Definition of patterns): Let p,q, r be patterns. 
Ri. If p = gr then cont, 0 first, = . 
R2. If p = (q|r) then first g N first, = >. 
R3. If p = (q)* then cont, n first, = >. 


Because our parsing algorithm continually requires us to treat A as a special case, we 
would like to know some of the null-string properties of patterns. 


Lemma 2 (A predicate): Let p,q, r be patterns. 
(a) If p = gr then Axp iff Axq a Ax«r. 
‘(b) If p = (qjr) then A«<p iff Axq v A«r. 
(c) If p = (q)* then Axp. 
Proof: Immediate from the definition of match.8 
Lemma 2 is the basis for the algorithm used by LAMBDA-P, which calculates whether or not 
A matches a particular pattern. 
The next lemma is relevant to the computation of the set first for a pattern. 
Lemma 3 (first): Let p,q, r be patterns. 
(a) If p = qr then 
l. if A*q then first) = first 
2. if Ax<q then first, = firsty U first... 


(b) If p = (q¢]r) then first y = first, U first. 


(c) If p = (q)* then first, - first. 
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Proof: Immediate from the definitions of first and match.# 

The parser look at the first symbol of a string in order to decide how to begin 
matching the string to a pattern. The next lemma guarantees that the parser never looks at 
~ when deciding; i.e, that the first symbol is always some delimiter and not part of a 
subexpression. 


Lemma 4: If p isa pattern, first p contains only tokens (not ~). 


Proof: By induction on the definition of a pattern. If p = A, "d", or "d" ~ then 
first, = >, {d}, or {d} respectively. If p = qr, (q|r), or (q)*, then by 


Lemma 3 and induction first p contains only tokens.# 


We turn now to properties of the set cont. We begin with its value relative to the 
null string. 


Lemma 5: If p isa pattern and A<p then cont » (A) = first, 
Proof: From definitions, 
cont (A) = Ungicy aun {first(B} = Ugicn pan tfirsttBl = firsty.t 


This result has a strong implication for star closure; restriction R3 prevents the use of star 
closure on nontrivial patterns matched by the null string. 


Lemma 6: If p is a pattern and cont, first) = @ and Axp then only A matches p. 


Proof: Assume A<p. By Lemma 5 we have first p ™ cont, (A) & cont p Since we assumed 
cont, M first, = g, it must be that first, = ¢, implying that no string other than A can 
match p.f 


The next lemma is a preliminary result to be used in the proofs of Lemmas 8 and 10. 
It deals with the way in which a string can match a concatenated pattern. 


Definition: The string w is a prefix of string w’ iff w’ = wa for some string a; if a # A 
then w is said to be a nontrivial prefix of w’. 


Lemma 7 (Ambiguity): Let ¢ and r be patterns. If 
(1) cont N first, = g, 
(2) w = @,W where w,;<q and Anwo~<r, 
(3) w! = w;‘Wo’ where w,'<¢ and wo’<r, and 
(4) w is a prefix of w’, 
then w; = wy’. 


Proof: Since w is a prefix of w’, exactly one of the following cases must hold: (i) w, is a 
nontrivial prefix of w,’, (ii) w,’ is a nontrivial prefix of w,, or (iii) @,; = w,’. We will 
show that (i) and (ii) do not hold. 

(i) w,’ = w ya for some a # A. By definition, first{a) € cont, (w,) S cont. Since 


W2 * A, we also have firstla) = first(w2) € first,, violating condition (1). 


(ii) w, = w)’a for some a # A. It cannot be the case that wo’ = A because w = w)W2 is 
a prefix of w’. By symmetry this reduces to case 1.8 


It is a corollary of Lemma 7 that when a string w matches p = qr, it matches in only one 
way. Applying Lemma 7 inductively, we get the same implication for star closure. 
Lemmas 8, 9, and 10 describe the contents of the set cont relative to concatenation, 
alternation, and star closure. Since these are the essential lemmas for the actual program 
proof, they are stated in terms of specific strings; i.e., they describe cont p') rather than 


cont ), The lemmas are intended to directly imply the correctness of the pattern matching 


part of the parsing algorithm. For example, Lemma 8 guarantees that concatenated patterns 
may be dealt with locally, one at a time. 
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Lemma 8 (cont): Let p = qrandw = w,W with w,;<q and wo<r, then 
(a) if woe then cont , (w) = cont, (wo) 
(b) if w2=A then cont yw) = cont, lw) U cont, (w,). 


Proof: (a) 

2 We have cont, (w2) = Uy gucr, pan tfirst(B) © UL, yo pocp, Ber first tan) 
= cont, (w) by definition and since w26~<r implies that w,;w26~<p. 

< By definition cont , (w) = Uap<p, Ber {first(B)}. If wB =w,w2b<p, then let 
wB = w' = w)'w2’ where w)'<q and wo'<r. Since w is a prefix of w’ and woxA, we 
have w; = w,’by Lemma 7. Then w2’ = wef, so first(B) € cont, (wo). 

(b) 

2 As in part (a) cont, (w2) ¢ cont »{w). In addition, 
cont, (w,) = Uy px<q, Baa tftrst (8)! ¢ Uy <p, Baa first (BV) = cont ,(w) since 
w ,6<q implies that w)B=w,Ab~<p. 

S By definition cont, (w) = Upp, Ger {first (B)1. If wB=0,w28=w,6~<p then let 
@ B= '= ‘wo’ where w,'<q and wo’<r. We consider the three cases of the relationship 
between w, and wo’, 
(i) If @, isa nontrivial prefix of w,’, then férst(B) € first. 
(ii) If w,’ is a nontrivial prefix of w,, then w2’*A. But then firstlw2’) « cont ¢ (w,'). 
Since firstlw') € first, this violates RI. 
(iii) If w,; = w)4 then B = we’ and first(®) € first,. By Lemma 5 


first, = cont, (A) = cont, (w2).f 
Lemma 9 (conz): Let p,q¢,r be patterns and p = (q|r). 
(a) If w=) then cont )(w) = first, U first, 
(b) If Anwx<q then cont, (w) = cont (w). 
(c) If Aew~<r then cont )(w) = cont, (w). 


Proof: (a) By Lemma 5 cont pla) = first), and by Lemma 3b first, = firste U first, 
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(b) Claim w6<p iff w6~<¢. Clearly w6~<¢ impties o8<p. Conversely, if wo8<p then either 
wh<q or wh<r. If wB~<r then first(wh) € first... But since wd, 
firstlw6) = firstlo) € first, violating R2, so wf<¢. We conclude that 
cont yw) = Uigicn Gun tfirsttOd) = Un gice, gina {ftrsttBdl = conty lw). 

(c) Similar to part (b).0 


Lemma. 18 (cont): Let p,¢,r be patterns and p = (¢)*. 
(a) If w=d then cont , tw) = firste, 


(b) If Aww~<p where w=w,...0, for n2l and aj<¢ for Lsisn, then 
cont , ( wo) = cont, {o,). U first, 


Proof: 

(a) If wed then cont, (A) = first, = first, by Lemmas 5 and Sc. 

(b) By Lemma 6 we need. only consider 2 cases: either ¢ is: matched by-only A, or ¢ is not 
matched: by. at:all. Since w#A we assume the second. case; where A*¢: 

2 We have first, = Upgucg, or (first (G1) S, Uapoop, ann ifirsttent = cont, (w), since 

— B<q and wp implies that aB~<p. We have also:conty (aq) = U, <4, Gur tfirst (B)} 
S Ung p, Bua first (ar) = cont, (w), since w,S<¢ and w,...W_1<p implies that 
wh~< p. 

© By induction on n. 
n=1: By definition cont 5 (wo) bad Uy b<p, Gar {first (®)}. Let @;B=0'=w,’. oe Wm’ 
where w,’<q for 1<ism. Since w’#d, mz. We-consider the three possible relationships 
between w, and w,’. 
(i) If w,’ is a nontrivial prefix of w, then m22, and since wo’eA (recalling that A*q),. we 
have firstlwo) «€ cont (w,) © cont). But also firstiwe’) € first, violating R35. - 
Contradiction. 
(ii) If w,=w,' then, since G«A, we have m22. Again wo'#A so 
first(B) = firstlwe) «© first. 
(iii) If w, is a nontrivial prefix of w,' then first(Q) © cont, (w4). 
n>1: Assume the result for n-1. If w-w,...w, then by definition: 
cont ,{w) = Uag~<p, Ber (ftrst(B)). If wB<p then let of=w'aw)'...W_ where w;'<¢ 
for 1<ism. Apply Lemma 7 as follows. We have w isa prefix of w’. Decompose w as 
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W = Wy) Wo..-W, Where w,)<q and Wo...W,~<p. Likewise w’ = W,’ Wo'.-. Wm’ where 
w,'<q and Wo’... W_m <p. Since first y = first, by Lemma 3c, we have 

cont, M first, = > (using R3). So by Lemma 7, w;=w,’. We now have 

Wore + WpB=Wo's + Wm’ <P, SO first(B) € cont » (wo. .-W,). By induction we have 

cont ,{w2...W,) = cont ¢ (wn) U first. 


Our final lemma about the set cont is the counterpart of Lemma 4 for the set first. 


Lemma 11: If pisa pattern, cont 


p contains only tokens (not ~). 


Proof: By induction on the definition of patterns. If p = A, “d", or "d" ~ then 
cont, = gd. If p = gr then by Lemma 8 and induction. If p = (q{r) then by Lemma 


9, induction, and Lemma 4. If p = (q)* then by Lemma 10, induction, and Lemma 4.8 


The final two lemmas are proofs of the pattern utility programs LAMBDA-P and 
FIRST. Their correctness will follow almost directly from Lemmas 2 and 3. 


Lemma 12 (LAMBDA-P): Let p bea pattern. Then (LAMBDA-P p) = T iff Ax<p. 


Proof: By induction on patterns. The program deals with five exclusive cases. When p=A 
the answer is T. When p="d" or "d" «#, then the answer is NIL. When p=qr the 
answer is (AND (LANBDA-P q) (LAMBDA-P r)), by induction and Lemma 2. Similarly, 
when p=(q|r) the answer is (OR (LAMBDA-P q) (LAMBDA-P r)), and when p=(q)* 
the answer is 1.8 


Lemma 13 (FIRST): Let p bea pattern. Then (FIRST p) = a list containing the 
symbols of first D 


Proof: By induction on patterns, the same five exclusive cases as the previous lemma; we 
use now Lemma 3 inductively. When p=) then NIL. When p="d" or "d" ~, then (d). 
When p=qr then (APPEND (FIRST q) (COND((LAMBDA-P q) (FIRST r)))), where 
A«p is determined by Lemma 12. When p=(q|r) then 
(APPEND (FIRST q) (FIRST r)). Finally, when p=(q)* then simply (FIRST q).6 
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V.C Parse Theorem | 
We present now the proof. of the first: PARSE theorem: stated: int Section V.A: 

| Veck’y (Pp Wgte)) = a) 
where E‘y is the. set of expression trees. defin 


function of Section 1V.B, and the-parsing f 
PARSE presented in Section 1V.C. The prog 


od. formally: in Section: LV..B, Wp is the writing 


ction Py correspends to. the LISP program 
nm. PARSE. accepts.as:input:a list of tokens; its. 


value, if it halts error-free, is the LISP representation of: asexpression tree, as defined in 
Section IV.C, The final token in any input string to PABBK isthe special termination 
symbol 4; the left: binding: power. of: this: symbol.is assumed: ta:-he -L the-enly: non-negative 


left binding power used. In terms of the program: the- theorem is: 
PARSE Theorem |: If e « E’p and & = Uple) then 
| (PARSE. -1 (8 42) = (reprfel +) * 
Its inductive proof requires a restatement inthe following, more general form: 
Theorem L.9: If for. sie e and rbp we age given 
Cl. ecE’p and § = ty... t, = Wpfed for ket. 
' C2. r-index Le] 2 Ibplt,,,). 
C3. thay € c-set [e]. 
C4. rbp < l-index[e}. 
C5. rbp 2 Ibplt,,,J. 


then (PARSE rbop (8 thetet oD) = (repr fe] tyapeoed. 
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PARSE Theorem I is a special case of Theorem 1.9 by the following argument. Cl is the 
given, letting § = t,...t,. The symbol t,,, is 4 which has a left binding power of -1, 
from Lemma | we know that r-index[le] 2 8, so C2 is satisfied. For condition C3 we 
observe that since 4 is not is the defined language, it cannot be in c-set[e]. As above, we 
know that |-indexle] 2 @, so C4 is true. Finally, we know that rbp = Ibp(4] = -1, 
satisfying C5. 


Outline of Proof 


Theorem 19 is the last in a sequence of nine subsidiary theorems, which correspond 
roughly to the subroutines of the program PARSE. Theorem 1.1 (FIND) covers the correct 
parsing of the annotation part of an expression. Theorems 1.2, 1.3, and 1.4 (NILFIX, 
PREFIX, and NUL-TYPE) deal with NUL-TYPE operators, and Theorems 1.5, 1.6, and 1.7 
(POSTFIX, INFIX, and LEF-TYPE) similarly treat LEF-TYPE operators. Theorems 1.8 and 1.9 
(PARSEa and PARSEb) state the top level behavior of the PARSE and ASSOC programs, the 
essential part of the parsing algorithm; Theorem 1.8 corresponds to the recursive parsing of 
left arguments and Theorem 1.9 to right arguments. Each theorem guarantees that if its 
arguments meet certain conditions, then the result of the corresponding subroutine has the 
desired property; i.e. that the subroutine operates correctly. With the exception of the 
language definition attached to property lists, as described in Section I1V.C, each subroutine 
uses only values given as explicit arguments. No side-effects need be mentioned since the 
given implementation of PARSE contains only local variables. 

The theorems are proven using simultaneous induction over the set E’p of expression 
trees. At each level of induction, they may be proven sequentially according to their 
dependence by subroutine calls, as diagrammed in the partial ordering of Figure 2. In this 
figure the proof of the upper theorem of a linked pair depends on the lower theorem; the 
inductive use of Theorems 1.8 and 1.9 is indicated at the bottom of the graph. For instance, 
Theorem 1.4 depends on Theorems 1.2 and 1.3, which in turn depend on [.1. In addition, 13 
and I.1 depend inductively on 1.9. 

We use simple induction in this theorem to correspond exactly to the definition of the 
domain Ep; i.e. using a basis and an induction step. This form of definition was chosen for 
clarity and precision. The nature of the domain would, however, allow a proof by strong 
induction (without a basis step), since the theorem only requires induction in the cases when 
there exist non-null subtrees. Rather than redefine the domain or create unnecessary 


19 PARSEA 
/| 
14 NUL-TYPE 17 LEF-TYPE 
19 PREF IX 12 NILFIX 15 POSTFIX 16 INFIX 
Sete, 
| 11 FIND 


Figure 2. interdependence of Theorems 11-19 
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confusion, strong induction is not used. 

We now want to examine the five conditions we will impose on our input string in 
order to guarantee that PARSE returns the correct value. When viewed relative to a call to 
PARSE, they have the following interpretations. Condition | requires that the input string 
begin with a sentence & of the language. Condition 2 insures that no subexpression on the 
right end of & becomes associated as a left argument to t,,;, if t,,, isan operator. If t,,, 
is a delimiter then condition 3 prevents its inclusion in any annotation within 5. The right 
binding power of the call to PARSE must be low enough for the entire expression to be 
returned, condition 4, but not so low that the expression is given as a left argument to t,,), 
condition 5. 


Statement of Theorems 1.1 through 1.9 


We precede our list of nine theorems by a formal statement of the conditions Ci 
through C5, on which they depend. For convenience in the proofs, the first three have 
been broken down into their definitional components: conditions Cla through Cif are 
equivalent to Cl, C2a and C2b are equivalent to C2, and C3a and C3b are equivalent to C3. 


Conditions: 


Cl. ecE’p ands = acOP6w = Wyle) = t,... ty 

Cla. a = Wp lees) if ejgge exists (A otherwise), B = Wpleg) if eo exists (A otherwise), 
and w = d)¥,...d,¥, for n28, where ¥; = Wyle) for 1<isn when e; is non-null 
(A otherwise), and ee, €g, €1,--- Cn E’p when they exist and are non-null. 

Cib. r-indexle,) = Ibplop) if ejgry exists. 

Clic. rbplop] < |-indexleg] if eg exists. 

Cid. rbplop} < JI-indexle] for 1<isn, when e; is non-null. 

Cle. d, € c-setleg] if eg and d, exist. 

Cif. d; ¢ c-setle_,] for 1<isn when e; is non-null. 


C2. r-indexfe] 2 Ibp[t,,;]. 
C2a. rbplop) = Ibplt,,,] if e, exists and is non-null. 
C2b. r-indexle,] 2 Ibpl[t,,,] if e, exists and is non-null. 


C3. ty; € c-set lel. 
C3a. tyes € cont » top] fe,,...e@,). 
C3b. t,,; € c-setle,] if e, exists. 


C4. rbp < t-index le). 
C5. rbp 2 Ibp [t,,,]. 


Notation: 
(i) When writing LISP expressions, upper case words and parentheses will always 
refer to LISP code; when describing known values within LISP expressions, 


lower case and square brackets will be used. Specificalty, the meta-variable op 
represents the token defined for op. 


(ii) The representation of the annotation part produced by F INO is 
((d, repr le,J)... (d, repr fe,))) and will be written repr fe,,...€,). 

(iii) Since the representations of patterns are not manipulated in this program, we 
will abbreviate repr [p[op]] to simply plop). 

(iv) In proofs, we will use the names Cl; C2, etc. to refer to the given conditions for 
the theorem being proved; Cl’, C2’, etc. will refer to the antecedents to be 
satisfied when using Theorems'!.8 and 1.9 inductively. 

We now state the nine theorems in full. 
Theorem 1.1 (FIND): Given Ci-C3 for some e. Then 
(FIND rbplop}] (nil w ty,;...) plopl) = freprile,,..-e@,] ty,y---) 
Theorem 1.2 (NILFIX): Given CI-C3 for some e. If op is defined NILFIX, 
(NILFIX op (w t,,j...) rbplop)] plop}) = (repre) t,,;...) 
Theorem 1.3 (PREFIX): Given CI-C3 for some e. If op is defined PREFIX, 


(PREFIX op (6 w t,,,-+-) rbplop) plop]) = (reprie] t,,;---) 
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Theorem 1.4 (NUL-TYPE): Given Cl-C3 for some e. If op is defined NILFIX or 
PREFIX, then 
(NUL-TYPE (op 6 w tyy-+-)) = (reprlel ty,)-++) 

Theorem 1.5 (POSTFIX): Given CI-C3 for some e. If op is defined POSTFIX, 

(POSTFIX reprle,] of (w t,,;---) rbplop) plop]) = iveoe fe] t,,,---) 
Theorem 1.6 (INFIX): Given Cl-C3 for some e. If op is defined INFIX, 

CINFIX repr leggy] op (6 w t,,;---) rbplop) plopl) = (reprfe] t,,,...) 


Theorem 1.7 (LEF-TYPE): Given Cl-C3 for some e. If op is defined POSTFIX or 
INFIX, then 


(LEF-TYPE (repr le.) op B w t,,)...)) = (repr [e) tyapeee) 
Theorem 18 (PARSEa): Given Cl-C4 for some e and rbp. Then 
(PARSE rbp (8 t,,;...)) = (ASSOC rbp (reprfel t,,)...)) 
Theorem 1.9 (PARSEb): Given CI-C5 for some e and rbp. Then 
(PARSE rbp (5 t,,;...)) = (repre) t,,;...) 


Proof of Theorems 1.1 through 1.9, Basis Step 


For the basis step we assume that the tree ecE’p is a single node whose label we 
denote op. Then op is defined NILFIX, Axplop], and t; = & = Wole) = op (n=8, 
k=1), so the annotation part is w=. Note that since op is defined NILFIX, Theorems 13, 
1.5, 1.6, and 17 are not applicable. 


Theorem 1.1 (FIND): If w=A matches plop) and if t; ¥ cont lop] (A) then 
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(FIND rbplop] (nil ty...) plopl) = (nil to...) 


Proof: The proof is by induction over the definition of the pattern plop]; the six possible 
cases are handled by the six conditional clauses in the program. 

Case |. If plop] = Athen (FIND rbplop] (nil to...) plopl) = (nil tz...) 
immediately. 

Cases 2,3. Impossible since if plop] = "“d" or “d" ~ it could not be that A<p [op]. 

Case 4. If plop] = qr, then (FIND rbplop] (nil ty...) plop)) 

= (FIND rbplop) (FIND rbp (nil tz...) q) r) by the program. We now use 
induction on the expression (FIND rbp (nil ty...) q). Since A<plop) and 
to € cont lop] (A), we know by Lemma 2a that Ax<q and by Lemma 8b that 
to € cont, (A), so this expression is (nil tz...) and we have 

= (FIND rbplop] (nil tz...) 7). As above we have A<r and tg ¢cont,{d), so by 
another induction we have 

= (nil to...). 

Case 5. If plop) = (q|r), then (FIND rbplop] (nil to...) plop]) is a conditional 
with three clauses. The first test is (MEMBER t, firsta), using Lemma 13 for the 
correctness of FIRST. Since ty ¢ cont [op] (A), we know that to € first, by Lemma 
9a, and this test will fail. Similarly the second test (MEMBER t2 first.) will fail. The 
third test (LAMBDA-P pfop]) will be true by Lemma 12 and our assumption, so the 
result is (nil to...). 

Case 6. If plop] = (¢)*, then (FIND rbplop] (nil to...) plop)) is a conditional 
with two clauses. The first test is (MEMBER tp first). Since ty € cont , (A) we have 
by Lemma 10 that ty ¢ firsty, and the test fails. The second clause then always returns 
(nil tp...).8 


Theorem 1.2 (NILFIX): Given Cl-C3 for some e. If op is defined NILFIX, 


{(NILFIX op (tz...) rbpflop] plop]) = (reprle] tp...). 


Proof: From the program we have the expression 
= (CONS(APPEND (of) 
(CAR(FIND rbplop] (nil tz...) plop)))) 
(COR(FIND rbplop] (nil tp...) pfop)))) 


v1 | 


By Theorem 1.1 we know the call to FIND returns (nil to...), so we have 
= (CONS(APPEND (op) nil) (to...)), which is 
= (reprle] t 2...) by the definition of representation.& 


Theorem 1.4 (NUL-TYPE): Given Cl-C3 for some e. If op is defined NILFIX or 
PREFIX, then 


{NUL-TYPE (op tp...)) = (reprfel tz...) 
Proof: By the program and Axiom I covering definitions, we have 
= (NILFIX op (tg...) rbpfop] plop]), which by Theorem 1.2 is 
= ({reprle] to...).8 
Theorem 1.8 (PARSEa): Given Cl-C4 for some e and rbp. Then 

(PARSE rbp (5 t2...)) = (ASSOC rbp (reprle] tp...)) 

Proof: By the program we have 
= {ASSOC rbp (NUL-TYPE (5 to...))), which by Theorem 1+ is 
= (ASSOC rbp (reprle] t,...)).8 
Theorem 19 (PARSEb): Given Cl-C5 for some e and rbp. Then 
Proof: By Theorem 1.8 we have 
= (ASSOC rbp (reprie) t2...)), which makes the test (LESSP rbp Ibp[t,]). By C5 


this is false, so the value is 
= (reprfe]l to...).1 


Proof of Theorems 1.1 through 1.9, Induction Step 


We assume that the tree ec E’p is a node, whose label we denote op, with subtrees. 
We assume Theorems 1.8 and 1.9 inductively for any of these subtrees. 
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Theorem Li (FIND): If w =d,¥,...d,¥, matches plop] for n28 where ¥, = Wyle) and 
e, € E’p for 1<isn, and if Cld, Cif, C2b, C3a, and C3b hold for w, then 


(FINO rbp lop] {nil 1) thepe ee) plop]) s (repr fe,,...@,] thepeee) 


Proof: Since FIND is called recursively, in general there will be some annotation fragment 
v, not necessarily nil, which has already been parsed at some previous stage of the 
execution. Thus, we wif! actually prove a more general assertion than that in the 
problem statement itself: ; 

(FIND ropflop] (0 w t,,;-..) plopl) = (vereprfe,,...e,] t,,;), 
where, for convenience, the result of appending two fists a.and b is written aeb. As in 
the basis step, the proof is by induction over the definition of the pattern plop]; the six 
possible cases are handled by the six separate clauses of the conditional statement. 

Case |. If plop] = A then {FIND rbplop) (y w t,.;-..} plop?) 

= (vy w ty)... Since w=), then n=8 and k=8, so 

= (v t,...). But reprle,,...e,] = nil so we have 

= (verepr [e,,...@,] ty.1---). 

Case 2. If plop] = "d" then we must havew = d = ty. Bye ree then, since d 
matches t,, (FIND rbpfop] fy w t,,;.--) plop} 

= (CONS (APPEND v ({d))) (t....)) 

= (vel(d)) t,...) which is 

= (verepr [e,,...@,] ty.y--.). 

Case 3. If plop] = “d" ~ then we must have w = d,¥, where ¥; = Wple,). By the 
program we have (FIND rbplop] (v w t,,;...) plopy) 

= (CONS (APPEND v (LIST(LIST d (CAR (PARSE rbpfop] (¥, thet oe DDD) 

{COR (PARSE rbp [op] (y, teepee DID) 
We apply Theorem 19 inductively to (PARSE rbp lop} (¥, t,,.)-..)). Cl’ is satisfied 
by our assumption about w. Since n=1, C2’ and C3’ are satisfied by C2b and C3b 
respectively. Finally, C4’ and C5’ are satisfied directly by Cid and C2a. We have then 
(PARSE rbplop] (¥, t,,)...)) = (reprfe,) t,,;...), so our result is 

= {CONS (APPEND v ((d reprle,J))) (t,,)...)) 

= (verepr fe,,..-€,) tyy---). 

Case 4. If plop] = gr then we must have » = w),w2 where 
@, = t)...t; = d)¥;...d,¥, matches ¢, with m28 and j28, and 


W2 = there ty = Gmei¥mey-+-dn%, matches r, with nem and k2j. 
By the program we have (FIND rbplop] (v w t,,,...) plop)? 
= (FIND rbplop) (FIND rbpfop] (v w t,).-.) ¢@) r). We first apply our inductive 
assertion to the nested expression which is equivalent to 
(FIND rbpfop] {uv w, ti.y-++) @). Conditions Cid’ and Clf’ about the internal 
properties of w, follow directly from Cld and Cif respectively. C2b’ through C3a’ deal 
with the token t,,,; so we must deal separately with the cases where w2A and wo=A. If 
@omA, then t;,; must be a delimiter by Lemma HI, so Ibp[t,,,] = 8, satisfying C2b’ and 
' C2a’. In this case t;,, is also the delimiter d,,,, so C3b’ is satisfied by CIf. Finally, we 
know that tj; € first), and by RI cont, ni first, = q, so C8a’ is satisfied. If w2=A, 
then t,,,;=t,,,, and m=n. In this case, since e,=e,, C2b’, C2a’, and C3b’ are satisfied 
directly by C2b, C2a, and C3b. Finally, since t,,, = ty, ¢ cont top) (e;,--+@,) by 
C3a, Lemma 8b says that t,,, ¢ cont, (ej, .++@,), Satisfying C3a’°, We have then by 


induction that this nested expression is (vereprfe,,...@,_] t),;-+-), so we have 
= (FIND rbplop] (vereprle,,...@,] wo t,,;---) r). We again use the assertion 
inductively. As before Cld’ and Cif’ are directly satisfied, but since the last part of wo is 
also the last part of w conditions C2b’, C2a’, and C3b’ are also directly satisfied. Since 
ty € CONE » (op) (e,,...€,), Lemma 8 says that t,,, € cont, (€p,),++- en), satisfying 
C3a’. We have finally, 
= (verepr [e,,...e,Jerepr [en.j.++-@n) type ee) 
= (verepr[e,,...e,] t,,)---) 
Case 5. If plopl=(q|r) then by the program we have 
(FIND roplop) (v w t,,;--.) plopl) 
= (COND ((MEMBER t, first,) (FIND rbplop)] (oy w thy.) a) 
((MEMBER t, first.) (FIND rbpfop] (fo w ty,;.--) rd) 
({LAMBDA-P plop]) (v w t,,;-..))) 
It must be that either wd matches qd, w#A matches r, or w=A. In the first case we have 
t; € first g, so the first test is true, and we get the value of 


(FIND rbpfop] (v w t,,;---) @). All conditions for induction are satisfied 
automatically except C3a’. From C3a we have that t,,,; € cont plop] (e,,...€,), but 


then Lemma 9b tells us that t,,,; € cont, (e,, ---@,). By induction then, this returns 
the correct value. In the second case, the first test will fail because t, € first, and R2 
says that first Nn first, = o. The second will be true, and as above the correct result 


will be returned. In the final case where w=A, the first two tests must fail for the 
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following reason: C3a says that t,,,; = t, € cont, lop] (e,,-..@,) = cont » lop] (A), so 
we know by Lemma 9a that t, can be in neither first, nor first... By Lemma 12 
(LAMBDA-P plop]) will be true, so (v w t,,;...) is returned; since 
repr [e,,...¢,] = nil, we here too get 
= (verepr fe, eee @y) ty). 
Case 6. If plop] = (q)* then by the program (FIND roplop] (v w t,,;---) plop!) 
= (COND ({(MEMBER t, first) . 


(FIND rbplop) {FIND rbplop) (v w t,,;...) g) plopl)) 
(T (vw tye d)) 

By definition either w=) or w=w,...w, for r28, where wj<q for 1lsisr. If w=A then 
the first test must fail for the following reason: from C3a we have 
t; = ty ¢ COME» [oh] (e;,-..e,) = cont» (A). But by Lemma 10a we know then 
t, € first g., so the correct result is a, For r>8 we prove the assertion by 
induction on r. 

n=8. Then we have w = w, = d)¥)...d,¥, matches g. By the program we have 
(FIND rbplop] tv w, t,,;--.) plop) 

= (COND ((MEMBER t, firsty) 

(FIND rbplop) (FIND rbplop] (v w, t,;-..) @) plopl)) 
(T fv wy typed) 

By our assumption the test t, € first, will be true. We first apply our induction 
hypothesis on patterns to the nested expression (FIND rbplop] (v w, ty,y---) q). 
We know by assumption that w,;<q. Conditions Cid’ through C3b’ are satisfied directly 
by Cid through C3b respectively. From C3a we know that t,,) € cont» lop] (e},-.-€,). 
By Lemma 10b we know that t,,; ¢€ cont | (wy), where w,=w), 80 Ca’ is satisfied. We 
then have 

= (FIND rbplop} (verepr [e,,...e,] t,,)-.-) plopl). Now we know by C3a that 
thet © Cont y (, p] (e,,...@,) so we know by Lemma 10b that t,,, € first. The test is 
false and the value of the program is 

= (verepr [e,,...e,] tyj--+)- 

n>8. Then we have w = w,w2...w, where w, = ty...tj = dj¥)...d,_¥, matches q, 
with m>8 and j>B, and wo...W_ = tyyese th = Gmei¥mere ++ Gn%, matches ploy), with 
n>m and k>j. By the program we have {FIND rbplop] (v wo t,,)-.-) plop)) 
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= (COND ((MEMBER t, first) 


(FIND rbplop] (FIND rbplop] (v w t,,;...) @) plop))) 

(T (v wo thy DD) 
By our assumption the test t, € first, will be true. We first apply our induction 
hypothesis on patterns to the the string w, in the nested expression 
{FIND rbplop] (v w; wo...W, ty.) @). Condition I!’ is true by our assumption 
about w. Cld’ and Clif’ are true directly from Cld and Cif. By Lemma 5, t,,), the first 
symbol in w2...w, is a delimiter, and so Ibp[t,,,] = 8, satisfying C2b’ and C2a’. We 
also know that tj,;=d,,,;, So C3b’ is satisfied by CIf. Finally, since t,,, € first g we know 


by restriction R3 that t;,; ¢ cont,, satisfying C3a’. We have then the value 


q’ 

= (FIND rbplop} (vereprfe;,...@n] wo--Wp_ t,,)) plopl). We now apply the 
induction on n to the string w9...,. Conditions Cid’ through C3a’ are directly satisfied 
by Cid through C3a respectively, so we have the value 

= (verepr [e,,...e,lerepr len,),.++n) tyaye+-) which is 


= {verepr le,,-.+@,J types DE 
Theorem 1.2 (NILFIX): Given CI-C3 for some e. If of is defined NILFIX, 
(NILFIX op (w ty...) rbplop] plop]) = (reprfel t,,)..-) 


Proof: From the program we have the expression 
= (CONS (APPEND (op) 
(CAR (FIND rbplop) (nil w t,,;...) plopl))) 
(COR (FIND rbpfop] (nil w t,,)-..) plop)))) 
By Theorem 1.1 we know that the call to FIND returns (repr le,,...@,] t,,;-+-), SO 
= (CONS (APPEND (op) reprie,,...e,)) (t,,)...)) 
= (reprfe] t,,;...).8 


Theorem 1.3 (PREFIX): Given Cl-C3 for some e. If op is defined PREFIX, 
(PREFIX of (6 w t,,;---) rbplop) plop]) = (reprte) t,,;...) 


Proof: From the program we have the expression 
= (CONS (APPEND (op) 
{(LISTILIST "RIGHT (CAR (PARSE rppfop] (8 w t,,)-..))))) 
(CAR(FIND rbp lop] | 
(CONS NIL (COR (PARSE rbplop] {6 wo t,,;-..)))) 
plopl))) 
(COR (FIND. rbp [op] . 
(CONS NIL (CDR(PARSE rbpfop] (6 w ty,)...)))) 
plop})) ) | 
Since B = Wpyleg) and egcE’p, we know that if we can show our five conditions hold for 
€g then we can apply Theorem 1.9 inductively in order to obtain . 
(PARSE: rbop lop] (6 @ tae .o)) = (repr feg] a Tee mee es 
From Cla we obviously have Cl’ satisfied. Cic telis us that ropfep] < !-indexlegl, 
which immediately gives us C4’. For C2’, C3’, and C5’ we must consider whether the 
annotation part w is the null-string or not. If w#, then the first-token of w is the 
delimiter d,, so C3’ is satisfied by Cle. Since Ibp[t,,,] = 8, conditions C2’ and C5’ are 
also satisfied. If w=A, then n=8 and we immediately get C3’ fram C3b, C2’ from C2b, 
and C5’ from C2a. Thus, the value of the expression is 
= {CONS (APPEND (op) 
((right repr [eg])) 
{CAR (FIND rbplop] (nil w ty,,--.) plopl))) 
(COR (FIND rbplop) (nil w thy...) plop)))) 
By Theorem 1. we know that the value of the call to FIND is 
(repr [e,,...@,] t,,;--.), So the value of the expression is 
= freprfe] t,,,...).1 


Theorem 1.4 (NUL-TYPE): Given CI-C3 for some e. If op is defined NILFIX or 
PREFIX, then 


(NUL-TYPE (op B w tyy---)) = (reprle] tyyeo) 


Proof: We consider the two possible cases. If op is defined NILFIX then 
{GET op *NUL-TYP) = NILFIX by Axiom I, so we have by the program and Axiom I 

= (NILFIX op (6 w t,,;...) rbplop) plopl), which by Theorem 1.2 is 

= (reprle] t,,)..+). Similarly, if op is defined PREFIX the correct value is returned by 
the program, Axiom 2, and Theorem 13. If there is no NUL-TYPE definition for op, then 
the value is 

= (NILFIX op € t,,;-..) 8 A), which is the default condition. In this case op is 
assumed to be nilfix with no arguments and null pattern, so by Theorem 1.2 the correct 
value is returned.§ 


Theorem 15 (POSTFIX): Given CI-C3 for some e. If op is defined POSTFIX, 
(POSTFIX repr le,g] op (wo t,,;-..) rbplop] plopl) = (reprfe] t,,,...) 


Proof: By the program we have the value 
= (CONS (APPEND (op) 
((left repr le,g))) 
(CAR (FIND rbplop) (nil w ty,y.--) plop)))) 
(COR (FIND rbplop] (nil w t,,)---) plopl))) 
By Theorem 1.1, the call to FIND has the value (repr [e,,...e,] t,,;---), so we have 
the expression 
= (CONS (APPEND (op) 
(Cleft repr [e.g ])) 
repr [e,,...e,)) 
(thee d) 
= (reprfe] t,,;.-.).8 


Theorem 1.6 (INFIX): Given Ci-C3 for some e. If op is defined INFIX, 
(INFIX repr leg] of (6 w t,,)...) rbplop) plop]) = (reprfe] t,,,...) 


Proof: By the program we have the expression 
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= (CONS (APPEND (op) 
(left repr fe,))) 
(LIST(LIST "RIGHT {CAR{PARSE rbpfop] {6 w t,,;...))))) 
(CAR (FIND rop fos] 
(CONS NIL {COREPARSE rbpfop] (6 wo t,,,...)))) 
plop}))) 
(CDOR(FINO rbp op] 
(CONS NIL (COR(PARSE risplopl{® w t,,;-.-)))) 
plop])) ) 
We use Theorem 1.9 inductively on the expression (PARSE ropfop] (6 w t,,;.-.)) in 
exactly the same manner as in the proof of Theorem 13 (PREFIX), yielding the value 
(repr leg] w t,,)...). We have then the expression 
= (CONS (APPEND (0p) 
(Cleft repr [e,])) 
({right repr feg))) 7 
(CAR (FIND rbplop) (nil w t,,;...) plopi))) 
(CDR (FIND rbpflop) (nil w t,,;...) ploph))) 
By Theorem 1.1 we know that the call to FIND returns the correct value, giving us 
= (CONS (APPEND (op) 
(Cleft repr le.gJ)) 
‘(right repr [eg])) 
repr [e,,...¢@,)) 
(type eed) 
= (reprle] t,,;...).0 


Theorem 1.7 (LEF-TYPE): Given CI-C3 for some e. If op is defined POSTFIX or 
INFIX, then : 


(LEF-TYPE (repr le.) of 6 w ty,;.-.)) = (reprfe] t,,,...) 


Proof: We consider the two possible cases. If op is defined POSTFIX then 
(GET op "LEF-TYP) = POSTFIX by Axiom 3, so we have by the program and Axiom 3 
= (POSTFIX repr leg) of (o ty,)---) rbplop] plop]), which by Theorem 15 is 


79 


= (reprle] t,,;-..). Similarly, if op is defined INFIX the correct value is returned by 
the program, Axiom 4, and Theorem 1.6.8 


Theorem 1.8 (PARSEa): Given CI-C4 for some e and rbp. Then 
(PARSE rbp (5 t,,,-..)) = (ASSOC rbp (reprfel t,,;---)) 


Proof: We consider the two possible cases: op is NUL-TYPE or LEF-TYPE. 

Case |. If op is defined NILFIX or PREFIX then we have a=A in § = aopBw, sot, = op. 
By the program we have (PARSE rbp lop 6 w ty,;..)) 

= (ASSOC rbp (NUL-TYPE {0p 6 w t,,;...))), which by Theorem 1.4 is 

# (ASSOC rbp (reprfe] t,,;...)). 

Case 2. If op is defined POSTFIX or INFIX then a#A. We apply Theorem 1.8 inductively to 
the expression (PARSE rbp (a op 6 w ty,;---)). From Cla we have a = Why lejg¢) 
where ey¢, € E’p, satisfying Cl. From Cib we have r-indexle,_] 2 Ibplopl, 
satisfying C2’. We do not allow LEF-TYPE operators to be used as delimiters, so since 
only delimiters can occur in c-set [e,4], C3’ is trivially satisfied. From C4 we have 
rbp < l-indexl[e], and since l-indexle] = minl(\bp lop], l-index le,4]], we have 
rbp < |-index [ej], satisfying C4’. By induction, then, we have 
(PARSE rbp (a of B w t,,).--)) 

= {ASSOC rbp (repr le.) of 6 w t,,;---)). The value of the call to ASSOC is a 
conditional whose first test is (LESSP rbp Ibplop]). By the same argument we used to 
satisfy C4’ above, we have rbp < Ibp [of], so the test is true and the result is 

= {ASSOC rbp (LEF-TYPE (reprie,] op 6 w t,,;---)). By Theorem 1.7 this is 

= {ASSOC rbp (reprfe] t,,;...)).2 


Theorem 1.9 (PARSEb): Given CI-C5 for some e and rbp. Then 
(PARSE rbp (5 t,,;...)) = (reprfe] t,,;...) 
Proof: By Theorem 1.8 we have (PARSE rbp (5 t,,;--.)) 
= (ASSOC rbp (reprle] t,,)...)). The value of the call to ASSOC is a conditional 


whose first test is (LESSP rbp !bpfop]). By C5 this is false, so the second clause 
returns {reprfe] t,,;---)-8 


V.D PARSE Theorem II 


We complete this chapter with the proof of the second PARSE theorem stated in 
Section V.A: 


YSe=* (Ppy(S) halts error-free » S€Sp) 


where 2* is any string of tokens and Sp is the defined language as described in Section 
IV.B. The program PARSE is given as input a list of tokens; if it halts error-free then that 
string must be the linear representation of a grammatical expression tree. Notice that our 
work is simplified by the fact that we do not worry about the value returned by the 
program; this leads us to adopt the following convention. 


Notation: We write (...) = (...) to mean that the LISP expression on the left 
evaluates error-free to the value on the right. The presence of LISP expressions whose 
value need not be discussed will be indicated by {...). 


We can now restate our theorem in terms of the program PARSE as follows. 
PARSE Theorem Il: If 5<* and if (PARSE -1 (6 4)) = ({...) 4), then S€Sp. | 
Outline of Proof 


The statement and proof of this theorem closely paraliel those of the first PARSE 
theorem. As before, our desired result is a corollary of the last in a series of nine subsidiary 
theorems, which correspond (in this case precisely) to the subroutines of the program PARSE. 
These theorems, however, are now in the converse form: whenever the subroutine returns a 
value certain properties are shown to be true about the input string. The proof is again by 
simultaneous induction with the theorems proven sequentially at each level. Their 
interdependence, including the inductive use of Theorem 24, is illustrated in Figure 3. The 
essential difference between the two PARSE theorems is the domain of induction; in this case 
we use induction on the length of strings in the set >*. | 
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(Induction) 


Figure 3. Interdependence of Theorems 2.1-2.9 
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Statement of Theorems 2.1 through 2.9 


Since we are given that S<2*, we will assume that the input list to PARSE is the list 
of tokens (t,...t,), for s21, with the convention that t,~4. As in the proof of PARSE 
Theorem I, we use an inductive generalization, Thearem 2.9, which makes use of 
Conditions Ci through C5. In this case, however, Ci through C5 are the consequents of the 
theorem. From CI we have the desired result that t;...t, € Sp 


Conditions: 


Cl. ecE‘p and & = oGPBw = Npfe) = t,... t, 

Cla. « = Wo lege) if ergy exists (A otherwise), 6 = Holey) if ep exists (A otherwise), 
andw = d,¥). ..d,¥, for n2@, where ¥; = Wyte) for isism when e; is non-null 
(A otherwise), and ey, €9, €1+--- Cn€E'p when they exist and are non-null. 

Clb. r-indexle,] 2 Ibplop] if eg exists. 

Clic. rbplop)] < t-indexleg] if eg exists. 

Cid. roplop] < |-indexle] for 1<i<n, when e; is non-nufl. 

Cle. d; € c-setleg] if eg and d, exist. 

Cif. d,; ¢ c-setle,_,] for 1<isn when e¢; is non-null. 


C2. r-indexfe] > Ibp[t,,,J. 
C2a. rbplop] 2 Ibplt,,,] if e, exists and is non-null. 
C2b. r-indexfe,] 2 tbplt,,,] if e, exists and is non-null. 


C3. t,,; ¢€ c-set lel. 
C3a. the} € cont » top] fey, + ++ en). 
C3b. t,,; ¢€ c-setle,] if e, exists. 


C4. rbp < I-index [el]. 
C5. rbp 2 bp {t,,,]. 


We state now the theorems in full. 
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Theorem 2.1 (FIND): If (FIND rbplop} (nil t,;.-.t,) plopl) = state for 
1sj<s then 


(a) state = ((...) t,,;...t,) where jsk<s 


(b) tha oe t, = @) = d,¥,---dn¥p, for n28 where ¥iehly (e;) and e€cE’p for lsis<n, 
and w<p. 


(c) Cld, Cif, C2a, C2b, C3a, and C3b hold for t,,)..« ty. 


a 


Theorem 2.2 (NILFIX): If t, = op is defined NILFIX and . 
(NILFIX op (tz...t,) rbplop] plop}) = state, for 1<s, then 


(a) state = ((...) ty,)-..t,) where lsk<s 


(b) Cl, C2, and C3 hold for t,... ty. 


Theorem 2.3 (PREFIX): If t, = op is defined PREFIX and 
(PREFIX op (to...t,) rbplop] plop]) = state, for 1<s, then 


(a) state = ((...) t,,;-..t,) where lsk<s 
(b) Ci, C2, and C3 hold for t,... ty. 

Theorem 2.4 (NUL-TYPE) 1 If for 1<s (NUL-TYP (t,...t,)) = state then 
(a) state = | ((...) ty,y-.0t,) where lsk<s 


(b) Cl, C2, and C3 hold for t,... t, where t, is defined NILFIX or PREFIX. 


Theorem 25 (POSTFIX): If t,,; = op is defined POSTFIX, t,...t, = Wple.g) for 
Cig € Ep, r-index fe.) > lbplt,,,J, and 
(POSTFIX (...). op (t,,2...ts) rbpfop) plop) = state where a+l<s then 

(b) Cl, C2, and C3 hold for t,... ty. 
Theorem 2.6 (INFIX): If t,,, = op is defined INFIX, t)...t, = Wp léy,q) for 
Cet € Ep, r-index le] > tbplt,,,], and 
(INFIX (...) op (t,,9...t,) rbplop] plop}) = state where atl<s then 

(a) state = ((...) type ty) where a<k<s 

(b) Cl, C2, and C3 hold for ty... t,. 
Theorem 2.7 (LEF-TYPE): If ty... t, = Woleyg) for e,Q€E'p, 
r-index le.) > Ibp(t,,,], and (LEF-TYPE (...) (t,,;...t,))} = state where a<s 
then 

(a) state = ((...) t,,)...t,) where ack<s 


(b) Cl, C2, and C3 hold for t,... t, where t,,, is defined POSTFEX or INFIX. 


Theorem 2.8 (ASSOC): If t,... t, satisfy Cl, C2, C3, and C4, and 
{ASSOC rbp ({...) t,)...t,) = state where j<s then 


(I) if rbp 2 Ibplt,;], state = CC...) type. ty). 
(2) If rbp < Ibp (t,,,], then 


(a) state = (ASSOC rbp ((...) t,,,...t,)) where j<k<s 
(b) Cl, C2, C3, and C4 hold for t;... t, where t,,, is defined POSTFIX or INFIX. 


Theorem 2.9 (PARSE): If (PARSE rbop (t,..t,)) = state for l<s then 

(a) state = ((...) t,,;...%,) where I<k<s 

(b) Cl through C5 hold for t,... t,. 

Since this proof is essentially concerned with error handling, we precede the basis 

Step with a preliminary lemma about the behavior of the PARSE program on trivially 
invalid inputs. The theorem itself deals with list arguments to PARSE of length two or 
more, and it.is important to know that no value will be returned for shorter lists. 
Lemma 2.1: (PARSE rbp (t,...t,)) returns an error if 8<2. 
Proof: By the program (PARSE rbp (t,...t,)) 


= (ASSOC rbp (NUL-TYPE (t,...t,))), but NUL-TYPE Immediately tests by evaluating 
(CDOOR (t,...t,)). If s<2 this will cause an error.8 


Proof _of Theorems 21 through 2.9, Basis Step 


For the basis step we assume that s=2, so the input string is (t; to) = (t, 11). 
Since Theorem 2.9 is the finat result and is the only theorem to be used inductively, it is the 
only essential part of the basis step proof. To prove Theorem 29 for the case s=2 we will 
also need Theorems 2.1, 2.2, and 2.4. 
Theorem 2.1 (FIND): If (FIND rbplop) {nil +4) plop}) = state, then 
(a) state = ((...) 4) 
(b) A=w matches plop) 


(c) Cld, Cif, C2a, C2b, C3a, and C3b hold for A. 


Proof: Since k<s, the second argument to FIND must be (nil 4). We prove then the 
following assertion inductively over the definition of the pattern plop). If 


(FIND roplop] (nif 4) plop}) = state, then <p and state = (nil 4). This 
assertion implies that w=A and so Cld and Cif are trivially satisfied. Since Ipp [4] --1 
C2 is satisfied, and C3 because 4 is not in the defined. Hanguage. 

Case |. If plop}= then true immediately. 

Cases 2,3. It cannot be that plop]="d" or “d" ~, because a value would onty be returned 
if d=4 and we know that 4 is not part of the defined language. 

Case 4. If plop] =qr then the value must be 
(FIND roplop} (FIND rbpfop] (nil 4) ¢) r}. By two uses of pattern induction 
we have Ax<q and Axr so Ap fopl, and the final result (nit 41). 

Case 5. If plopl={¢|r) then, since 4 cannot be in either of first, or first ,, it must be that 
A<plop) and (nil 4) is returned. 

Case 6. If plop) =(¢)* then, since 4 cannot be in first, the result (nil 4) is returned 
and A~<p flop) by definition.§ 


Theorem 2.2 (NILFIX): If op is defined NILFIX and 
(NILFIX of (4) rbplop] plop}) = state then 
(a) state = ((...) 4) 
(b) Cl, C2, and C3 hold for t). 


Proof: By the program (NILFIX op (4) rbplop} pfop}) 

= (CONS (...) (COR{FIND rbplop] (nil 4) ptopl))}. By Theorem 2.1 this is 

= ((...) 4), and we know A<plopl. Since op is defined NILFIM- we have then that 
S=t,=0p=Wp (le) for ecE’p, completing Cl. We have already C2 and C3 from | 
Theorem 2.1. 


Theorem 2.4 (NUL-TYPE): If (NUL-TYP (t, 4)) = state then 

(a) state = ({...) 4) 

(b) Cl, C2, and C3 hold for t, where t, is defined NILFIX or PREFIX. 
Proof: NUL-TYPE only returns a value in three cases. 


Case |. If t,=op is defined NILFIX then we have the value 
(NILFIX of (4) rbplop] plop]) and we are done immediately by Theorem 2.2. 
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Case 2. If t,=op is defined PREFIX then we have the value 
(PREFIX of (4) rbplop) pfop]). But PREFIX evaluates the expression 
(PARSE rbplop) (4)) which by Lemma 2.1 causes an error.§ 

Case 3. If t,=o0p is not defined, then it is assumed by default to be a variable or constant; 
we have then the value (NILFIX op (4) 8 A) and we are again done by Theorem 2.2. 


Theorem 2.9 (PARSE): If (PARSE rbp (t, 4)) = state then 
(a) state = ((...) 4) 
(b) Cl through C5 hold for ty. 

Proof: By the program (PARSE rbp (t, 4)) = (ASSOC rbp (NUL-TYP (t, 4))). By 
Theorem 2.4 we know that this is = (ASSOC rbp ((...) 4)) and that Cl, C2, and C3 
hold. Since we know that Ibp[4]=-1, ASSOC returns the value ((...) 4), and C5 is 


satisfied. Finally, since op has no left argument we have |-indexfe] = o, satisfying 
C44 


Proof of Theorems 2.1 through 2.9, Induction Step 


We now assume that s>2 and that Theorem 2.9 holds for strings of length less 
than s. 


Theorem 2.1 (FIND): If (FIND rbplop) (nil t,,...t,) plopl) = state for 
1sj<s then : 


(b) t),)-..t, = @ = d)¥)...d,¥, for n28 where ¥=Wple) and e€E‘p for 1sisn, 
and w<p. 


(c) Cld, Cif, C2a, C2b, C3a, and C3b hold for t),)... ty. 


Proof: The proof is by induction over the definition of the pattern p{op]; the six possible 


cases are handled separately by the six conditional clauses in the program. 

Case |. If plopl=\ then (FIND rbplop) (nil t,,,;...t,) plop) 

= ((...) tj)... t,). In this case w= which clearly matches p{op]. Only condition C3a 
is relevant to this case, but since p=A, we have cont plop] ~ d. 

Case 2. If plop]="d" then the program will only return a value if d=t,,,. If it does, the 

’ value is 

= ((...) t2...t,). It must be the case that j+1<s, since j+1=s would imply that 
tj.;=4 which we know cannot match any delimiter in the defined language. Clearly 
w=t;,; matches plop), and since there is no e;, the only relevant condition to satisfy is 
again C3a. Since p="d", we have cont , lop] = >. 

Case 3. If plopl="d" ~, then the program will only return a value if d=t,,,. If it does, 
the value is 

= ({...) (CDR (PARSE rbplop] (t),2...t,)))). We must have j+2<s, since PARSE 
returns an error otherwise by Lemma 2.1. We have then by an inductive use of Theorem 
2.9 the value 

= ((...)  ty,).-. tg) where j+2<k<s. We also know that the following conditions hold 
for ¥; = tj2--.t, Cl ¥,;=Wple,) for e,€E’p, C2’ r-indexle,)] > Ibp(t,,,], 
C3’ t,,;¢c-set le], Cf rbplop) < t-indexfe,], and C5’ rbplop) 2 Ibplt,,,]. We 
now show that the necessary conditions hold for t;,,;...t,. We have first that 
tj.j--- ty=d¥, which clearly matches [op]. Cid is satisfied directly by C4’. Condition 
Clif does not apply, since n=1. C2a is satisfied by C5’ and C2b by C2’. C3b is satisfied 
directly by C3’ and C3a from the fact that cont plop] = in this case. 


Case 4. If plopl=qr, then the value of the program is 

= (FIND rbplop] (FIND rbplop] ({...) t),)...t,) @) r). By pattern induction on 
the innermost expression we have the value 

= (FIND rbplop] ((...) t,,--.t,) 7) for jsh<s. We know that 
tyje> ty2dy¥)..-dm%n=w), for Bsm, which matches q, and that all conditions (call them 
Cid’, Cif’, etc.) hold for w,. By another use of pattern induction we have the value 

= ((...) ty... ts) for nsk<s. We know that ty)... ty=Umej¥mer: > + Gp¥m™@2, for msn, 
which matches r, and that all conditions (call them Cid”, Cif”, etc.) hold for w2. We now 
show that all conditions hold for tj)... t,. Clearly j<k<s and t),)... t,=@;Wz matches 
plop). Cld follows directly from Cld’ and Cid”. Cif follows from Clif’ and Cif” with 
one exception. We must show that d,,,¢c-set fe,,]; i.e, that the first delimiter of wo is 
not in the c-set of the last argument of w,. This case is covered by C3b’. If woxA then 
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C2a follows directly from C2a”, otherwise from C2’. Similarly C2b follows from either 
C2b” or C2b’. We know from C3a” that t,,;€cont l@n.j.+-+@n). If wad then by 
Lemma 8a we know theif CONE y Ig 4) (e,,-.-@,). If wo=A then t,,;=t,,; and we also 
know that tj.1 7 Cont g (e,,.-.@,). But by Lemma 8b it is also true that 
tke €cont, (op) (e,,...@,), satisfying C3a. Finally, C3b follows from C3b” when wax”, 
otherwise from C3b’. 

Case 5. If plopl=(q|r) then the program only returns a value in one of three cases. If 
the first test is true, tie first, then we have 

= (FIND rbplop) ({...) t),;...t,) ¢. By pattern induction this is 

= ((...) ty,,.-.t,), where all conditions except C3a are satisfied immediately. We know 
from C3a’ that the CONE g (e,,.-+@,). By Lemma 9b, since w#A in this case, we also 
know ther €cont , lop] {e,,...€,), Satisfying C3a. If the first test is false and the second 
test, t;,,€ first ,,is true then we have the same situation. If the first two tests are false, 
and the third is true, Ax plop], then the result is 

= ((...) tj,).-. ts), where w=A. In this case the only relevant condition is C3a. Since 
we know by the failure of the first two tests that t,,)¢ firsta and t,,,€ first,, we know by 
Lemma 8 that t),;¢cont, lop] (A). 

Case 6. If plop]=(q)* then value of the program is a conditional whose test is tj,,€ first g. 
If the test fails then the value is 

= ((...) t))...t,). If the test succeeds, then the value is a a recursive call to FIND for 
plop], after another w; has been found to match g. We prove by induction on the 
number of calls to FIND for plop] made before returning. The hypothesis is that each 
time there is a call of the form (FIND rbplop} ((...) ty,)---t,) plop]), then all of 

_ the conditions except C3a are true of the string t,,,...t,. Thus, when the test finally 
fails, we only need show that C3a is true to be done, but by Lemma 10 we know that if 
tye & first qv when the test fails, and if theif cont, len), which we know from the 
induction hypothesis C3b’, then t,,;¢cont 5 lop] (e),..-@,), satisfying C3a. We now 
prove the hypothesis. 

Basis: At the first call, we have j=h, or w=A. Since plopl= =(q)*, we know that wox<p lop). 
No other conditions are relevant to this case. 

Induction: If all conditions except C3a are true of t,,,... ty=@1..++;.; (call these conditions 
Cld’, Cif’, etc.), and if the test the 1€ firsty is true, then we have 
(FIND rbplop] ((...) (ty,)...t,) plopl) 

= (FIND rbplop] (FIND rbplopi ((...) ty,)---t,) @) plop)). We know by our 


induction over patterns (the higher level induction in this theorem) that this is 

= (FIND rbplop] ((...) ty)... ts) plop)), where the string t,,;... t,=w; matches ¢. 
We also know that all the conditions are true for this string (call these Cid”, Cif”, etc.). 
We now show that all conditions are true for the whole string ‘t;,;... t,.;. Since 
@---@.1<p and wi<q, we know that t),).-. tyew,...0;<p. Cld is satisfied directly by 
Cld’ and Cid”. Cif is similarly satisfied by Cif’ and Cif” with one exception. We need 
to show that the first symbol of w, is not in the c-set of the last argument in w,... 044, 
but this follows from C3b’. We recall that Lemma 6 says that.A cannot match ¢. Then 
we have conditions C2a, C2b, and C3b following directly from C2a’, C2b’, and C3b’ 
respectively. Thus all conditions except C3a are satisfied.§ 


Theorem 2.2 (NILFIX): If t, = op is defined NILFIX and 
(NILFIX op (tp...t,) rbplop] plopl) = state, for 1<s, then 


(a) state = ({(...) tapers ty) where l<k<s 

Proof: By the program (NILFIX op (tz...t,) rbplop] plop)) 

= (CONS (...) (CDR (FIND rbplop] (nil tz...t,) plop]))}). So by Theorem 2.1 

= ((...) tyy--- ts) where 2<k<s. Since the annotation part tz... t, matches plop] by 
the theorem and Cid and Clf hold, we have satisfied Cl, because Clb, Clc, and Cle are 
not relevant to the NILFIX case. Then t,;...t, = Wple) for e€E’y By Theorem 21 we 
also have C2a, C2b, C3a, and C3b, which give us C2 and C3 for e.8 | 


Theorem 2.3 (PREFIX): If t, = op is defined PREFIX and 
(PREFIX op (to...t,) rbplop] pfopl}) = state, for l<s, then 


(a) state = ((...) t,,;...t,) where lsk<s 
(b) Cl, C2, and C3 hold for t,... ty. 


Proof: By the program (PREFIX op (to...t,) rbplop] plop]) 


91 


= (CONS (...) (COR (FIND rbp lop) 
(CONS NIL (CDR (PARSE rbplop] (to...t,)))) 
plop}))). 
We first consider the expression (PARSE rbp[op] (tz... t,)). It must be the case that 
s>2, otherwise PARSE causes an error by Lemma 2.1. We have then by the inductive use 
of Theorem 2.9 that the result is 
= (CONS (...) (COR (FIND rbplop] (nil t,,;...t,) pflop]))), where we know the 
following about t2...t; = 6: Cl’ B=Wpleg) for egeE’p, C2’ r-indexfeg] 2 Ibp[t,,)], 
C3’ t;,;¢€c-set leg], C4 rbplop] < |-indexleg], and C5 rbplop] = Ibp[t,,,). 
Finally, we know that j<s, so we apply Theorem 2.1 and get 
= (CC...) ty,;...t,), where we know k<s and that Cld, Cif, C2a, C2b, C3a, and C3b 
already hold for the expression t,...t,. We satisfy the others as follows. We have now 
t,... t,=0pBw where op is defined PREFIX, and annotation part w matches p [op], 
satisfying Cla. Clb is not relevant to this case. Clc is satisfied by C4’. Cle is only 
relevant if w*A in which case it is satisfied by C3’. If w#A then e, is part of the 
annotation w and conditions C2 and C3 follow from C2a, C2b, C3a, and C3b obtained 
from Theorem 2.1. If w=A, then e,=eg. In this case C2 is satisfied by C5’, and C3 is 
satisfied by.C3a and C3’5 


Theorem 2.4 (NUL-TYPE): If for 1<s {NUL-TYP (t,..t,)) = state then 
(a) state = ((...) t,,,;-..t,) where l<k<s 
(b) Cl, C2, and C3 hold for t,... t, where t, is defined NILFIX or PREFIX. 


Proof: NUL-TYPE returns a value in three possible cases. 

Case |. If t,=op is defined NILFIX then we have the value 

= (NILFIX op (to...t,) rbplop] plop]) and the result is immediate by Theorem 2.2. 

Case 2. If t,;=op is defined PREFIX then we have the value 

= (PREFIX op (to...t,) rbplop] plop)) and the result is immediate by Theorem 2.2. 

Case 3. If t, is undefined, then it is assumed by default to be NILFIX with rbp=8 and 
plop)=). We have the value 

= (NILFIX op (tz...t,) @ A) and the result is immediate by Theorem 2.2.8 


Theorem 2.5 (POSTFIX): If t,,, = op is defined POSTFIX, t,...t, = Wp leq) for 
CenE Ey, r-index fe.) > topft,,,J, and 
{POSTFIX (...) op (tg... ts) rbplop] plop] = state where a+l<s then 


(a) state = ((...) t,,)...t,) where ack<s 


Proof: By the program we have (POSTFIX (...) op (tgo--.t,) rbplop] plop)) 

= (CONS (...) (COR (FIND roplop] (nil t,,>...t,) ptlopl))). So by Theorem 2.1 

@ (f...) ty,;-.. ts) for atisk<s, and w=t,,,... t, matches plop] with Cld, Cif, C2a, 
C2b, C3a, and C3b already true. Since t;... ty=a=lly (eg) by assumption we have 
t,-.. t,xcopw, satisfying Cla. Clb is satisfied by given, and Cic and Cle are not 
relevant to the POSTFIX case, so we have t,... t,=Wple) for e€E’p, satisfying Cl. C2 
and C3 now follow directly from C2a, C2b, C3a, and C3b.4 | 


Theorem 2.6 (INFIX): If t,,; = op is defined INFIX, t,...t, = Npleje) for 
Cen E’p, r-index fe.) > Ibplt,,,J, and 
CINFIX (...) op (tg2...t,) rbplop] plop}) = state where a+i<s then 


(a) state = ((...) t,,)...t,) where ack<s 
(b) Ct, C2, and C3 hold for t;... t,. 


Proof: By the program we have(INFIX (...) op (ty,9..-t,) rbpfop] plopl) 
= (CONS (...) (CDR (FIND rbp lop) 
(CONS NIL (CDR (PARSE rbplop] (t,,o... t,)))) 
plop)))) 
Using the same argument as in Theorem 23, substituting (tere .t,) for (to... tg), we 
apply Theorems 2.9 and 21 in order to get 
= ((...) t,,j...t,). Conditions Cl-C3 are also satisfied for the same reasons as in 
Theorem 2.3, with the exception of Cib which is no longer irrelevant but is satisfied by 
the given.8 
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Theorem 2.7 (LEF-TYPE): If t,...t, = Wp leis) for empl’, 
r-index fe.) > Ibplt,,,], and (LEF-TYPE (...) (t,,;...t,)) = state where a<s 
then 


(a) state = ((...) t,,)-..t,) where a<k<s 
(b) Cl, C2, and C3 hold for t,...t, where t,,, is defined POSTFIX or INFIX. 


Proof: It must be the case that a+l1<s, otherwise LEF-TYPE returns an error by checking 
(CDDR {t,,,...t,)), and LEF-TYPE only returns a value in following two cases. 

Case I. If t,,,;=0p is defined POSTFIX then we have the value 

= (POSTFIX (...) op (tz...ts) rbplop] plop]) and the result is immediate by 
Theorem 2.5. 

Case 2. If t,,,=0p is defined INFIX then we have the value 

= (INFIX (...) op (to... t,) rbplop} pfop]) and the result is immediate by 
Theorem 2.6.8 


Theorem 2.8 (ASSOC): If t,...t, satisfy Cl, C2, C3, and C4, and 
(ASSOC rbp ((...) t),)...t,) = state where j<s then 


(2) If rbp < Ibp{t,,,], then 


(b) Cl, C2, C3, and C4 hold for t,... t, where t,,, is defined POSTFIX or INFIX. 


Proof: The program is a conditional which tests (LESSP rbp Ibp[t,,,]). If the test is 
true then we have part |. If false we have 

= (ASSOC rbp (LEF-TYPE ((...) t),;...t,)). From the given we know Cr’, C2’, C3’, 
and C4’ for t,...t; By Cl’ and C2’ the conditions for Theorem 2.7 are satisfied so we 
have 

= {ASSOC rbp ((...) t,,;...t,) where j<k<s. We know further that Cl, C2, and C3 
hold for t,... t, where t;,, is defined POSTFIX or INFIX. C4 holds since we have 


rbp < Ibp[t,,,] by assumption and C4’8 

Theorem 2.9 (PARSE): If (PARSE rbp (t;..t,}) = state for i<s then 
(a) state = ((...) t,,;...t,) where lsk<s 
(b) Cl through C5 hold for t,... ty. 


Proof: By the program (PARSE rbp (t,...t,)) 

= (ASSOC rbp (NUL-TYP (t,...t,))). By Theorem 2.4 we know that this is 

= (ASSOC rbp (({...) t),)...t,) where 1s j<s, and Cl, C2, and C3 hold for t,... t; 
Since t, is defined NILFIX or INFIX we know by definition that |-index[e] = «, so 
C4 is also satisfied. We know by Theorem 28 that ASSOC either returns when 
rbpfop] 2 tbplt,,,] or calls itself recursively with conditions Cl, C2, C3, and C4 still 
satisfied. By induction, when ASSOC does halt, Ci, C2, C3, and C4 still hold. In 
addition condition C5 is satisfied by the failure of the test. Clearly ASSOC must 
eventually halt, since at each call we know j<k<s; i.e, every call removes more symbols 
from the input stream.8 


VI. CONCLUSIONS 


VILA Summary 


We began with the observation that BNF is not effective as a practical 
meta-language for programming language designers, implementers, and users. We used 
Pratt’s CGOL technique for translator construction, and specified a meta-language which 
avoids many of the difficulties inherent in BNF approaches. Its essential feature is an 
expressive power which is very closely related to the actual parsing technique of the 
translator: we can conveniently describe exactly those languages which the translator 
technique handles well. An immediate consequence is freedom from the awkward 
restrictions inherent in most automatic translator construction systems. 

We have demonstrated these advantages by presenting the design of a CGOL based 
parsing program; although the meta-language is based on Pratt's informal! syntactic 
guidelines, we have demonstrated with a formal correctness proof that none of the rigor of 
more traditional approaches has been sacrificed. The first part of this proof deals 
exclusively with properties of the meta-language; these results permit a very straightforward 
program proof, and may be applied equally well to proofs of other implementations. 


VI.B Further Work 


The use of nonstandard syntactic descriptions is an open area for research. The 
example presented in this paper treats a class of languages appropriate to the CGOL 
technique; it should be feasible to apply the same approach in other, perhaps more 
Specialized, contexts. Even within the CGOL system there are a number of issues which 
need more thought. For example, the meta-language presented in this paper uses. regular 
expressions to specify multiple right arguments. More than half of the proof is devoted to 
patterns, and the parser for them is the one long program in the system. The generality of 
regular expressions may not be worth the effort involved. Other unresolved issues deal 
with delimiters, e.g. it is not absolutely necessary that they have left binding powers of zero. 
This convention was imposed for simplicity. 


There are also a number of unfinished implementation issues. The LISP 
implementation of the parser is much longer and less efficient than necessary but could be 
immediately improved by the use of global variables and side effects. The actual parser 
should be as short as most of the definitions for CGOL given in [Pratt 1974] In addition, 
an actual implementation of the meta-language processor is desirable. This could take the 
form of an interactive definitional facility, providing the designer with on-line assistance, 
such as production debugging, and with incremental implementation, e.g. for bootstrapping. 
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SUMMARY OF NOTATION 


p,q,r are CGOL annotation patterns 
~ is a metasymbol used in productions to denote the presence of an argument 
D is a language definition (a set of productions) 


e is an expression tree . 
Ep is the set of expression trees corresponding to a definition D 
E’p is the set of grammatical expression trees corresponding to a definition D. 


op is an operator 

t is a token, a lexeme 

d is a delimiting token 

a,8,¥%,5,w are strings of tokens 

A is the empty string 

Sp. isa set of strings, over some alphabet of tokens, corresponding to a definition D 


Wp is a writing function defined on Ep with values in Sp 


Pp _ is the parse function corresponding to a definition D 
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